There have always been issues with dynamic pages and getting them indexed in the search engines. Search engines these days do a much better job at indexing dynamic content, but there are still times that websites have problems in the search engines.
First of all, you should know that search engine spiders do not accept cookies. So if you have a site that forces a user or search engine spider to accept cookies and not provide any alternative way into the site, then there is going to be a problem.Next having session ids in the url could possibly have your site to not show in the search engines or there is the chance that you could end up with duplicate content in the search engines due to the same page getting re-indexed with different session ids. We will look at a few options that are available for you. Chances are, you have seen sites/pages in the search engine results that have had session ids. The trick is knowing how to get around this for the search engines.
If you do have session ids on your site, one option could be, assign the session id via cookies and have it not required. Basically, try to do so without making them a requirement, this is generally where the problem tends to lie, making it a requirement to further proceed throughout your site.
If you want to see if your site can be indexed in the search engines, try using a text browser such as Lynx. This is basically what the search engine will see when it comes to index your site/pages. If you encounter any problems with this program with your site, then there is a very good chance that the search engines will encounter a problem as well.
Blocking Session Ids with the Robots.txt File
Use pattern matching in combination with the Allow directive in your robots.txt file.
For instance, if a osCsid indicates a session ID, you may want to exclude all URLs that contain them to ensure the search engines don’t crawl duplicate pages.
User-agent: *
Disallow: /*osCsid
Using a question mark as an example indicating a session id, URLs that end with a ? may be the version of the page that you do want included. Disallow: /*? will prevent duplicate pages/content from getting indexed. Set your robots.txt file as such:
User-agent: * Allow: /*?$ Disallow: /*?
The Allow: /*?$ directive will allow any URL that ends in a ? (more specifically, it will allow any URL that begins with your domain name, followed by a string, followed by a ?, with no characters after the ?).
The Disallow: / *? directive will block any URL that includes a ? (more specifically, it will block any URL that begins with your domain name, followed by any string, followed by a question mark, followed by any string).