What is robots.txt file and how to use it?

Updated: September 22, 2020
/
/
/
What is robots.txt file and how to use it?

The world of search engine optimisation is intricate and complex, but we like to think it’s a world full of digital marketing phenomena that we enjoy using each and every day. One small, but vital part of what we do, is working with the tiny, and often tricky, robots.txt file. Getting it right is quite a feat, and draws upon all the creative, strategic, and technical forces that make for successful, high-quality SEO. To help our clients understand why we use robots.txt, and how to inject efficiency using it, we’ve put together a concise guide on everything you need to know about the robots.txt file, and the role it plays for your digital world. Read up and get familiar with robots.txt and it’s power over your digital domain!

What is the robots.txt file?

Getting to the bottom of a robots.txt file starts with understanding Google as a search engine. More specifically; what Google does in order to determine where your website sits in search rankings. Imagine a whole army of invisible henchmen spreading out across the internet, on a journey to search each and every web page in existence.

In Google’s case, these are often called a user agent Googlebot or crawler. Ultimately, these little bots are programmed agent Googlebot spiders that dive into a website and look around to see how it can be indexed in search results. Think of them as troops that bring in information to search engines, allowing them to determine where to rank pages.

Your pages might hold all kinds of information and this can be hard for search engines to determine where to place or rank them in results. User agent Googlebots identify content on pages across products, through to photos, contact forms, and “back-end” files or folders to assess what you’re all about. Those back-end files and folders are likely things that you don’t want to have visible in search results, or anywhere on the public realm, so that’s where providing a user agent disallow to tell the robots.txt that search engines don’t need those particular pages.

It could be because they’re irrelevant to your audience, technical, or even hold sensitive information that would put your business at a disadvantage, if they were made public. Unless you specifically restrict access through user agent disallow functions on robots.txt, search engines will try to crawl the pages through user agents.

Simply put, you need to use robots.txt to tell search engines when they AREN’T ALLOWED to use those pages to rank your website. Think of it as a robots exclusion that forms a user agent Googlebot disallow to say “Hey, Google, don’t look at this page – don’t need you to.”

Source: Search Engine Journal
Source: Search Engine Journal

When you have a user agent disallow, the crawler will see the robots.txt but note that you don’t want to use these pages ion search engines. Every site needs robots.txt files to tell user agents how to act or whether to abide by robots exclusion guidelines. This is a simple file that you (or, more likely, your webmaster wizard), can write up in notepad, text editor, or any word processor to effectively put restrictions crawlers. When you use robots.txt files, you are given the opportunity to enforce a user agent Googlebot disallow.

Your website likely already has a robots.txt file within it. If you go to the base URL of your website (i.e, www.edgeonline.com.au), then add a simple /robots.txt at the end of it, you’ll see that a robots.txt file pop up. If your pages have been created correctly, the robots.txt will have strategic lines of code that work in your favour for a user agent.

Inside the file, you’ll see these lines of code in use that could span across anything from a few lines to a long list, almost all of which will start with “Disallow”. This is what a user agent and other search engine crawlers are looking for in order to avoid certain pages. Robots.txt files essentially give a crawl delay and stop any indexing in its tracks.

What do robots.txt file do on search engines?

At its most primitive state, robots.txt allows you to block user agents from looking at pages, crawling the website, or indexing it all together on search engines. The robots.txt file tells the search engine not to crawl the content on a page. Sometimes a search engine will crawl pages or see that there are a lot of links pointing at the URLs. This tells user agents and bots that those pages are authoritative. Therefore, they are indexed and show up on Google using a meta description.

Sometimes a meta description can’t be found in the robots.txt file. If you see this appear, it’s time to go into the robots.txt file and fix the problem – which should lead to a user agent function that lets you spark better rankings or resolve any crawling issues. Choosing to use a robots.txt file can be written to block pages on your site from a search engine You could block one URL, one directory, or specific parts of the website.

The bigger the website, the more lines you’ll likely have on your robots.txt file and the more complex user agent functions are.

However, it’s important to note that the more lines your robots.txt file has, the more there is going on for your user agent functions across your entire site. This can create a delay and slow the speed of your pages right down, so use strategically and carefully. Slow speeds often cause your pages to rank poorly or hinder user experience. If you’re not sure whether your pages should be excluded or included in your use of a robots.txt file, make sure you seek expert advice (like from our tech-heads at Edge).

How bots work to index your page
How bots work to index your page

One way to prevent sluggish speeds is to create a purposeful crawl delay in your robots.txt file. A crawl delay is sometimes factored into robots.txt to help take the load off the server. This is done by writing up commands in the text file that tell the searcher to wait ‘X’ amount of seconds between requests before it crawls again. For example, if you robots.txt has several user agent disallow lines for your pages, it might be a good idea to have the web crawler wait 30 seconds or more before it looks at the next line or pages in the file.

Depending on the size and structure of your pages, this is a strategy that could be put into play to help make sure the site runs smoothly, and easily comes up in search. Servers and robots across search engines are quite sophisticated, and it’s not always a necessity to factor in crawl delay into your robots.txt. Instead, the server itself might account for extra activity, and change dynamically to suit what’s going on at the present moment.

As this technology advances, there’s a bigger, faster shift towards reactive web tools that are sensitive to situations and parameters – something we consider to be the future of web design. However, no matter how far forward these advancements go, your robots.txt will always be an essential part of your pages. So it’s worth knowing everything there is about it from the get-go.

IMPORTANT NOTE: Having a robots.txt file at your disposal doesn’t guarantee that crawlers and Googlebots won’t crawl your pages. There is software you can use that is specifically crafted to get into your robots.txt file and crawl it, without heeding any of the specifications outlined in the file. As always in business and marketing, being strategic about what you put up within robots.txt is the best way to avoid prickly situations.

What should your robots.txt look like?

A robots.txt file is essentially something you should use when you want to speak directly to search engines and software agents, rather than humans and customers. Google recommends that all pages use a robots.txt file, and it factors robots.txt in as part of it’s ranking analysis. The robots.txt file itself isn’t anything fancy – just lines of text on a white page. This text instructs Googlebots (Google’s search engine “henchmen”) and other web crawlers on what to do when they go into certain areas of your site. Determined by the content of your robots.txt file, these “digital robots” will either use this to gain access to a page, or you’ll set up a disallow function in robots.txt where you ban the web crawler from accessing specific information.

To start off with, a simple robots.txt text file is created in which these parameters are set out. When you use a robots.txt file, make sure to name it correctly. Robots txt, robot txt, robot.txt, or anything other than robots.txt simply won’t come up when the user agent crawls your site. The user agent is the name of the crawler. You’ll have to describe the agent that you want to restrict access to in the code of your robots.txt file. If there’s NO crawler that you’re willing to let anywhere near your info, then your robots.txt will have a star (like this (*)) to indicate an ALL agent disallow function.

The robots.txt file explained and illustrated by Varvy

The second line of your robots.txt file is where you put the specific pages in which you don’t want to be accessed. This is also where It’s a good idea to define your robots.txt file so that it knows where to find your sitemap. Other than being a handy feature to get a crawler’s attention, mentioning your sitemap within robots.txt helps boost your SEO rankings, and most importantly, helps your audience find what they’re looking for with the click of a link

What does this file look like?

A typical example of robots.txt files will start off with the first line containing “user-agent” or a (*), to address all crawlers, as we mentioned above. This line in your robots.txt file gets the attention of the online robots out there. The subsequent lines in robots.txt tell them what they’re allowed to do on your site – or, more accurately, what they’re not allowed to do. From the second line onwards, a robots.txt file will start with “Disallow”. It’s not necessary to specify “Allow” functions in robots.txt – the software will automatically go anywhere that doesn’t have a restriction specified. It’s always a good idea to be cautious with the allow and disallow in any robots.txt file – the content that you’re restricting access to are case sensitive. In robots.txt, Disallow can be specified for files, folders, or extensions, or even an entire site, if the aim is private viewing only.

When all the content in the robots.txt file is defined, it’s time to upload it online. Your robots.txt file should always be uploaded to your root directory, which is essentially the HQ of your back-end functions. As long as you have access to your root directory, you can simply write your file, and upload it to the site.

REMINDER: Make sure you spell the robots.txt file as lower case.

Be aware that your robots.txt file isn’t just accessible to search engines and those nifty little web crawlers – it’s accessible to ANYONE. That means that you’ll want to be careful with what you put on there, as any sensitive information can potentially be a cybersecurity weakness. If, for example, you have a password file, or a folder containing private information that you aren’t prepared to have unveiled within robots.txt, it’s best not to upload it onto your site at all. Hackers often look for the robots.txt file as a starting point if they’re in search of a way to breach your security, so be mindful of the content that you’re laying bare within a robots.txt file.

NOTE: It’s always a good idea to check up on your robots.txt file from time to time. You’ll want to make sure that you’re not blocking any new pages or directories that you want the bot to find, and Google to amplify on their site. In our professional opinions, robots.txt is a technical art form that requires a bit of effort for a small file – but makes a huge difference to rankings when done right. If you’re not sure how to check all of this, our digital marketing experts can do it for you.

What does robots.txt mean for your SEO?

The small, but powerful robots.txt file has a big impact on how Google gets into and sees your website. Executed correctly, with all the factors outlined above in full play, it can be smooth sailing. If there’s a block in the way that doesn’t work for you, or Google, then it can take just one line in robots.txt for your website to be taken down a notch in the rankings.

One of Google’s main tools for analysing a website’s value and performance is Google Search Console. Through Search Console, a check of your site can reveal what might be holding you back from receiving your best rankings yet. There are all kinds of errors that can pop up; from broken links to broken URLs.

One of the most common errors that you might find, however, is a robots.txt glitch. What this means, usually, is that bots can’t crawl and index a particular portion, file, of website section. In turn, that part of the website won’t show up in search results. If it’s something that you want to be made visible to all, this becomes a problem, as it hinders you from growing your presence. In the case of say, an e-commerce company that has a large number of new pages constantly being created, monitoring your robots.txt becomes an essential part of your back-end web activity to ensure that there aren’t any clashes going on.

In the context of a big e-website, if a new page or category is created under a disallow section, you might not see any traffic going to it at all. That one small line of code in robots.txt holds power that can influence the rise or fall of your bottom line – so it’s important to pay extra careful attention to it with every update, and cross-check to eliminate the error.

A final note on web robots and SEO

If done correctly, robots.txt could be no trouble at all; promoting efficiency and great rankings in your digital space. Our robots.txt experts have seen all the technicalities that a robots.txt file brings to the table, and know how to play it to your advantage. Sound a bit overwhelming? Don’t worry, the SEO world is full of technical information and – well – jargon. But that’s what we love about it. And luckily, we can decipher all of this for you to help you understand whether your site is looking at how it should, and those bots are indexing pages appropriately.

Sean -

SEO Director

1300 558 659 - www.edgeonline.com.au

Related Articles

Let's Chat

We’re an Digital Agency that has achieved outstanding real results for businesses throughout Australia.

Let's get in touch

We're ready for any enquiries you have.

8:30am - 5:00pm