Data

Splunk: How to work with multiple indexes [Tutorial]

11 min read

An index in Splunk is a storage pool for events, capped by size and time. By default, all events will go to the index specified by defaultDatabase, which is called main but lives in a directory called defaultdb.

In this tutorial, we put focus to index structures, need of multiple indexes, how to size an index and how to manage multiple indexes in a Splunk environment.

This article is an excerpt from a book written by James D. Miller titled Implementing Splunk 7 – Third Edition.

Directory structure of an index

Each index occupies a set of directories on the disk. By default, these directories live in $SPLUNK_DB, which, by default, is located in $SPLUNK_HOME/var/lib/splunk.

Look at the following stanza for the main index:

[main] 
homePath = $SPLUNK_DB/defaultdb/db 
coldPath = $SPLUNK_DB/defaultdb/colddb 
thawedPath = $SPLUNK_DB/defaultdb/thaweddb 
maxHotIdleSecs = 86400 
maxHotBuckets = 10 
maxDataSize = auto_high_volume

If our Splunk installation lives at /opt/splunk, the index main is rooted at the path /opt/splunk/var/lib/splunk/defaultdb.

To change your storage location, either modify the value of SPLUNK_DB in $SPLUNK_HOME/etc/splunk-launch.conf or set absolute paths in indexes.conf.

splunk-launch.conf cannot be controlled from an app, which means it is easy to forget when adding indexers. For this reason, and for legibility, I would recommend using absolute paths in indexes.conf.

The homePath directories contain index-level metadata, hot buckets, and warm buckets. coldPath contains cold buckets, which are simply warm buckets that have aged out. See the upcoming sections The lifecycle of a bucket and Sizing an index for details.

When to create more indexes

There are several reasons for creating additional indexes. If your needs do not meet one of these requirements, there is no need to create more indexes. In fact, multiple indexes may actually hurt performance if a single query needs to open multiple indexes.

Testing data

If you do not have a test environment, you can use test indexes for staging new data. This then allows you to easily recover from mistakes by dropping the test index. Since Splunk will run on a desktop, it is probably best to test new configurations locally, if possible.

Differing longevity

It may be the case that you need more history for some source types than others. The classic example here is security logs, as compared to web access logs. You may need to keep security logs for a year or more, but need the web access logs for only a couple of weeks.

If these two source types are left in the same index, security events will be stored in the same buckets as web access logs and will age out together. To split these events up, you need to perform the following steps:

  1. Create a new index called security, for instance
  2. Define different settings for the security index
  3. Update inputs.conf to use the new index for security source types

For one year, you might make an indexes.conf setting such as this:

[security] 
homePath = $SPLUNK_DB/security/db 
coldPath = $SPLUNK_DB/security/colddb 
thawedPath = $SPLUNK_DB/security/thaweddb 
#one year in seconds 
frozenTimePeriodInSecs = 31536000

For extra protection, you should also set maxTotalDataSizeMB, and possibly coldToFrozenDir.

If you have multiple indexes that should age together, or if you will split homePath and coldPath across devices, you should use volumes. See the upcoming section, Using volumes to manage multiple indexes, for more information.

Then, in inputs.conf, you simply need to add an index to the appropriate stanza as follows:

[monitor:///path/to/security/logs/logins.log] 
sourcetype=logins 
index=security

Differing permissions

If some data should only be seen by a specific set of users, the most effective way to limit access is to place this data in a different index, and then limit access to that index by using a role. The steps to accomplish this are essentially as follows:

  1. Define the new index.
  2. Configure inputs.conf or transforms.conf to send these events to the new index.
  3. Ensure that the user role does not have access to the new index.
  4. Create a new role that has access to the new index.
  5. Add specific users to this new role. If you are using LDAP authentication, you will need to map the role to an LDAP group and add users to that LDAP group.

To route very specific events to this new index, assuming you created an index called sensitive, you can create a transform as follows:

[contains_password] 
REGEX = (?i)password[=:] 
DEST_KEY = _MetaData:Index 
FORMAT = sensitive

You would then wire this transform to a particular sourcetype or source index in props.conf.

Using more indexes to increase performance

Placing different source types in different indexes can help increase performance if those source types are not queried together. The disks will spend less time seeking when accessing the source type in question.

If you have access to multiple storage devices, placing indexes on different devices can help increase the performance even more by taking advantage of different hardware for different queries. Likewise, placing homePath and coldPath on different devices can help performance.

However, if you regularly run queries that use multiple source types, splitting those source types across indexes may actually hurt performance. For example, let’s imagine you have two source types called web_access and web_error.

We have the following line in web_access:

2012-10-19 12:53:20 code=500 session=abcdefg url=/path/to/app

And we have the following line in web_error:

2012-10-19 12:53:20 session=abcdefg class=LoginClass

If we want to combine these results, we could run a query like the following:

(sourcetype=web_access code=500) OR sourcetype=web_error 
| transaction maxspan=2s session 
| top url class

If web_access and web_error are stored in different indexes, this query will need to access twice as many buckets and will essentially take twice as long.

The life cycle of a bucket

An index is made up of buckets, which go through a specific life cycle. Each bucket contains events from a particular period of time.

The stages of this lifecycle are hotwarmcoldfrozen, and thawed. The only practical difference between hot and other buckets is that a hot bucket is being written to, and has not necessarily been optimized. These stages live in different places on the disk and are controlled by different settings in indexes.conf:

  • homePath contains as many hot buckets as the integer value of maxHotBuckets, and as many warm buckets as the integer value of maxWarmDBCount. When a hot bucket rolls, it becomes a warm bucket. When there are too many warm buckets, the oldest warm bucket becomes a cold bucket.
  • Do not set maxHotBuckets too low. If your data is not parsing perfectly, dates that parse incorrectly will produce buckets with very large time spans. As more buckets are created, these buckets will overlap, which means all buckets will have to be queried every time, and performance will suffer dramatically. A value of five or more is safe.
  • coldPath contains cold buckets, which are warm buckets that have rolled out of homePath once there are more warm buckets than the value of maxWarmDBCount. If coldPath is on the same device, only a move is required; otherwise, a copy is required.
  • Once the values of frozenTimePeriodInSecs, maxTotalDataSizeMB, or maxVolumeDataSizeMB are reached, the oldest bucket will be frozen. By default, frozen means deleted. You can change this behavior by specifying either of the following:
    • coldToFrozenDir: This lets you specify a location to move the buckets once they have aged out. The index files will be deleted, and only the compressed raw data will be kept. This essentially cuts the disk usage by half. This location is unmanaged, so it is up to you to watch your disk usage.
    • coldToFrozenScript: This lets you specify a script to perform some action when the bucket is frozen. The script is handed the path to the bucket that is about to be frozen.
  • thawedPath can contain buckets that have been restored. These buckets are not managed by Splunk and are not included in all time searches. To search these buckets, their time range must be included explicitly in your search.

I have never actually used this directory. Search https://splunk.com for restore archived to learn the procedures.

Sizing an index

To estimate how much disk space is needed for an index, use the following formula:

(gigabytes per day) * .5 * (days of retention desired)

Likewise, to determine how many days you can store an index, the formula is essentially:

(device size in gigabytes) / ( (gigabytes per day) * .5 )

The .5 represents a conservative compression ratio. The log data itself is usually compressed to 10 percent of its original size. The index files necessary to speed up search brings the size of a bucket closer to 50 percent of the original size, though it is usually smaller than this.

If you plan to split your buckets across devices, the math gets more complicated unless you use volumes. Without using volumes, the math is as follows:

homePath = (maxWarmDBCount + maxHotBuckets) * maxDataSize 
coldPath = maxTotalDataSizeMB - homePath

For example, say we are given these settings:

[myindex] 
homePath = /splunkdata_home/myindex/db 
coldPath = /splunkdata_cold/myindex/colddb 
thawedPath = /splunkdata_cold/myindex/thaweddb 
maxWarmDBCount = 50 
maxHotBuckets = 6 
maxDataSize = auto_high_volume #10GB on 64-bit systems 
maxTotalDataSizeMB = 2000000 
Filling in the preceding formula, we get these values: 
homePath = (50 warm + 6 hot) * 10240 MB = 573440 MB 
coldPath = 2000000 MB - homePath = 1426560 MB

If we use volumes, this gets simpler and we can simply set the volume sizes to our available space and let Splunk do the math.

Using volumes to manage multiple indexes

Volumes combine pools of storage across different indexes so that they age out together. Let’s make up a scenario where we have five indexes and three storage devices.

The indexes are as follows:

Name Data per day Retention required Storage needed 
web 50 GB no requirement ? 
security 1 GB 2 years 730 GB * 50 percent 
app 10 GB no requirement ? 
chat 2 GB 2 years 1,460 GB * 50 
percent 
web_summary 1 GB 1 years 365 GB * 50 percent

Now let’s say we have three storage devices to work with, mentioned in the following table:

Name Size 
small_fast 500 GB 
big_fast 1,000 GB 
big_slow 5,000 GB

We can create volumes based on the retention time needed. Security and chat share the same retention requirements, so we can place them in the same volumes. We want our hot buckets on our fast devices, so let’s start there with the following configuration:

[volume:two_year_home] 
#security and chat home storage 
path = /small_fast/two_year_home 
maxVolumeDataSizeMB = 300000 
[volume:one_year_home] 
#web_summary home storage 
path = /small_fast/one_year_home 
maxVolumeDataSizeMB = 150000

For the rest of the space needed by these indexes, we will create companion volume definitions on big_slow, as follows:

[volume:two_year_cold] 
#security and chat cold storage 
path = /big_slow/two_year_cold 
maxVolumeDataSizeMB = 850000 #([security]+[chat])*1024 - 300000 
[volume:one_year_cold] 
#web_summary cold storage 
path = /big_slow/one_year_cold 
maxVolumeDataSizeMB = 230000 #[web_summary]*1024 - 150000

Now for our remaining indexes, whose timeframe is not important, we will use big_fast and the remainder of big_slow, like so:

[volume:large_home] 
#web and app home storage 
path = /big_fast/large_home 
maxVolumeDataSizeMB = 900000 #leaving 10% for pad 
[volume:large_cold] 
#web and app cold storage 
path = /big_slow/large_cold 
maxVolumeDataSizeMB = 3700000 
#(big_slow - two_year_cold - one_year_cold)*.9

Given that the sum of large_home and large_cold is 4,600,000 MB, and a combined daily volume of web and app is 60,000 MB approximately, we should retain approximately 153 days of web and app logs with 50 percent compression.

In reality, the number of days retained will probably be larger. With our volumes defined, we now have to reference them in our index definitions:

[web] 
homePath = volume:large_home/web 
coldPath = volume:large_cold/web 
thawedPath = /big_slow/thawed/web 
[security] 
homePath = volume:two_year_home/security 
coldPath = volume:two_year_cold/security 
thawedPath = /big_slow/thawed/security 
coldToFrozenDir = /big_slow/frozen/security 
[app] 
homePath = volume:large_home/app 
coldPath = volume:large_cold/app 
thawedPath = /big_slow/thawed/app 
[chat] 
homePath = volume:two_year_home/chat 
coldPath = volume:two_year_cold/chat 
thawedPath = /big_slow/thawed/chat 
coldToFrozenDir = /big_slow/frozen/chat 
[web_summary] 
homePath = volume:one_year_home/web_summary 
coldPath = volume:one_year_cold/web_summary 
thawedPath = /big_slow/thawed/web_summary 
thawedPath cannot be defined using a volume and must be specified for Splunk to start.

For extra protection, we specified coldToFrozenDir for the indexes’ security and chat. The buckets for these indexes will be copied to this directory before deletion, but it is up to us to make sure that the disk does not fill up. If we allow the disk to fill up, Splunk will stop indexing until space is made available.

This is just one approach to using volumes. You could overlap in any way that makes sense to you, as long as you understand that the oldest bucket in a volume will be frozen first, no matter what index put the bucket in that volume.

With this, we learned to operate multiple indexes and how we can get effective business intelligence out of the data without hurting system performance. If you found this tutorial useful, do check out the book Implementing Splunk 7 – Third Edition and start creating advanced Splunk dashboards.

Read Next:

Splunk leverages AI in its monitoring tools

Splunk’s Input Methods and Data Feeds

Splunk Industrial Asset Intelligence (Splunk IAI) targets Industrial IoT marketplace

Pravin Dhandre

Category Manager and tech enthusiast. Previously worked on global market research and lead generation assignments. Keeps a constant eye on Artificial Intelligence.

Share
Published by
Pravin Dhandre

Recent Posts

Harnessing Tech for Good to Drive Environmental Impact

At Packt, we are always on the lookout for innovative startups that are not only…

2 months ago

Top life hacks for prepping for your IT certification exam

I remember deciding to pursue my first IT certification, the CompTIA A+. I had signed…

3 years ago

Learn Transformers for Natural Language Processing with Denis Rothman

Key takeaways The transformer architecture has proved to be revolutionary in outperforming the classical RNN…

3 years ago

Learning Essential Linux Commands for Navigating the Shell Effectively

Once we learn how to deploy an Ubuntu server, how to manage users, and how…

3 years ago

Clean Coding in Python with Mariano Anaya

Key-takeaways:   Clean code isn’t just a nice thing to have or a luxury in software projects; it's a necessity. If we…

3 years ago

Exploring Forms in Angular – types, benefits and differences   

While developing a web application, or setting dynamic pages and meta tags we need to deal with…

3 years ago