(For more resources related to this topic, see here.)

Puppet is an extensible automation framework, a tool, and a language. We can do great things with it, and we can do them in many different ways. Besides the technicalities of learning the basics of its DSL, one of the biggest challenges for new and not-so-new users of Puppet is to organize code and put things together in a manageable and appropriate way.

It's hard to find a comprehensive documentation on how to use public code (modules) with our custom modules and data, where to place our logic, how to maintain and scale it, and generally, how to manage the resources that we want in our nodes and the data that defines them safely and effectively.

There's not really a single answer that fits all these cases. There are best practices, recommendations, and many debates in the community, but ultimately, it all depends on our own needs and infrastructure, which vary according to multiple factors, such as the following:

The number and variety of nodes and application stacks to manage
The infrastructure design and number of data centers or separate networks to manage
The number and skills of people who work with Puppet
The number of teams who work with Puppet
Puppet's presence and integration with other tools
Policies for change in production

In this article, we will outline the elements needed to design a Puppet architecture, reviewing the following elements in particular:

The tasks to deal with (manage nodes, data, code, files, and so on) and the available components to manage them
Foreman, which is probably the most used ENC around, with Puppet Enterprise
The pattern of roles and profiles
Data separation challenges and issues
How the various components can be used together in different ways with some sample setups

The components of Puppet architecture

With Puppet, we manage our systems via the catalog that the Puppet Master compiles for each node. This is the total of the resources we have declared in our code, based on the parameters and variables whose values reflect our logic and needs.

Most of the time, we also provide configuration files either as static files or via ERB templates, populated according to the variables we have set.

We can identify the following major tasks when we have to manage what we want to configure on our nodes:

Definition of the classes to be included in each node
Definition of the parameters to use for each node
Definition of the configuration files provided to the nodes

These tasks can be provided by different, partly interchangeable components, which are as follows:

site.pp is the first file parsed by the Puppet Master (by default, its path is /etc/puppet/manifests/site.pp) and eventually, all the files that are imported from there (import nodes/*.pp would import and parse all the code defined in the files with the .pp suffix in the /etc/puppet/manifests/nodes/ directory). Here, we have code in the Puppet language.
An ENC (External Node Classifier) is an alternative source that can be used to define classes and parameters to apply to nodes. It's enabled with the following lines on the Puppet Master's puppet.conf:
```
[master]
  node_terminus = exec
  external_nodes = /etc/puppet/node.rb
```
What's referred by the external_nodes parameter can be any script that uses any backend; it's invoked with the client's certname as the first argument (/etc/puppet/node.rb web01.example.com) and should return a YAML formatted output that defines the classes to include for that node, the parameters, and the Puppet environment to use.

Besides the well-known Puppet-specific ENCs such as The Foreman and Puppet Dashboard (a former Puppet Labs project now maintained by the community members), it's not uncommon to write new custom ones that leverage on existing tools and infrastructure-management solutions.
LDAP can be used to store nodes' information (classes, environment, and variables) as an alternative to the usage of an ENC. To enable LDAP integration, add the following lines to the Master's puppet.conf:
```
[master]
  node_terminus = ldap
  ldapserver = ldap.example.com
  ldapbase = ou=Hosts,dc=example,dc=com
```
Then, we have to add Puppet's schema to our LDAP server. For more information and details, refer to http://docs.puppetlabs.com/guides/ldap_nodes.html.
Hiera is the hierarchical key-value datastore. It is is embedded in Puppet 3 and available as an add-on for previous versions. Here, we can set parameters but also include classes and eventually provide content for files.
Public modules can be retrieved from Puppet Forge, GitHub, or other sources; they typically manage applications and systems' settings. Being public, they might not fit all our custom needs, but they are supposed to be reusable, support different OSes, and adapt to different usage cases. We are supposed to be able to use them without any modification, as if they were public libraries, committing our fixes and enhancements back to the upstream repository. A common but less-recommended alternative is to fork a public module and adapt it to our needs. This might seem a quicker solution, but doesn't definitively help the open source ecosystem and would prevent us from having benefits from updates on the original repository.
Site module(s) are custom modules with local resources and files where we can place all the logic we need or the resources we can't manage with public modules. They may be one or more and may be called site or have the name of our company, customer, or project. Site modules have particular sense as a companion to public modules when they are used without local modifications. On site modules, we can place local settings, files, custom logic, and resources.

The distinction between public reusable modules and site modules is purely formal; they are both Puppet modules with a standard structure. It might make sense to place the ones we develop internally in a dedicated directory (module paths), which is different from the one where we place shared modules downloaded from public sources.

Let's see how these components might fit our Puppet tasks.

Defining the classes to include in each node

This is typically done when we talk about node classification in Puppet. This is the task that the Puppet Master accomplishes when it receives a request from a client node and has to determine the classes and parameters to use for that specific node.

Node classification can be done in the following different ways:

We can use the node declaration in site.pp and other manifests eventually imported from there. In this way, we identify each node by certname and declare all the resources and classes we want for it, as shown in the following code:
```
node 'web01.example.com' {
  include ::general
  include ::apache
}
```
Here, we may even decide to follow a nodeless layout, where we don't use the node declaration at all and rely on facts to manage the classes and parameters to be assigned to our nodes. An example of this approach is examined later in this article.
On an ENC, we can define the classes (and parameters) that each node should have. The returned YAML for our simple case would be something like the following lines of code:
```
---
classes:
  - general:
  - apache:
parameters:
  dns_servers:
    - 8.8.8.8
    - 8.8.4.4
  smtp_server: smtp.example.com
environment: production
```
Via LDAP, where we can have a hierarchical structure where a node can inherit the classes (referenced with the puppetClass attribute) set in a parent node (parentNode).
Via Hiera, using the hiera_include function just add in site.pp as follows:
```
hiera_include('classes').
```
Then, define our hierarchy under the key named classes, what to include for each node. For example, with a YAML backend, our case would be represented with the following lines of code:
```
---
classes:
  - general
  - apache
```
In site module(s), any custom logic can be placed as, for example, the classes and resources to include for all the nodes or for specific groups of nodes.

Defining the parameters to use for each node

This is another crucial part, as with parameters, we can characterize our nodes and define the resources we want for them.

Generally, to identify and characterize a node in order to differentiate it from the others and provide the specific resources we want for it, we need very few key parameters, such as the following (the names used here may be common but are arbitrary and are not Puppet's internal ones):

role is almost a standard de facto name to identify the kind of server. A node is supposed to have just one role, which might be something like webserver, app_be, db, or anything that identifies the function of the node. Note that web servers that serve different web applications should have different roles (that is, webserver_site, webserver_blog, and so on). We can have one or more nodes with the same role.
env or any name that identifies the operational environment of the node (if it is a development, test, qa, or production server).

Note that this doesn't necessarily match Puppet's internal environment variable. Someone prefers to merge the env information inside role, having roles such as webserver_prod and webserver_devel.
Zone, site, data center, country, or any parameter that might identify the network, country, availability zone, or datacenter where the node is placed. A node is supposed to belong to only one of this. We might not require this in our infrastructure.
Tenant, component, application, project, and cluster might be the other kind of variables that characterize our node. There's not a real standard on their naming, and their usage and necessity strictly depend on the underlying infrastructure.

With parameters such as these, any node can be fully identified and be served with any specific configuration. It makes sense to provide them, where possible, as facts.

The parameters we use in our manifests may have a different nature:

role/env/zone as defined earlier are used to identify the nodes; they typically are used to determine the values of other parameters
OS-related parameters such as package names and file paths
Parameters that define the services of our infrastructure (DNS servers, NTP servers, and so on)
Username and passwords, which should be reserved, used to manage credentials
Parameters that express any further custom logic and classifying need (master, slave, host_number, and so on)
Parameters exposed by the parameterized classes or defines we use

Often, the value of some parameters depend on the value of other ones. For example, the DNS or NTP server may change according to the zone or region on a node. When we start to design our Puppet architecture, it's important to have a general idea of the variations involved and the possible exceptions, as we will probably define our logic according to them. As a general rule, we will use the identifying parameters (role/env/zone) to define most of the other parameters most of the time, so we'll probably need to use them in our Hiera hierarchy or in Puppet selectors. This also means that we probably will need to set them as top scope variables (for example, via an ENC) or facts.

As with the classes that have to be included, parameters may be set by various components; some of them are actually the same, as in Puppet, a node's classification involves both classes to include and parameters to apply. These components are:

In site.pp, we can set variables. If they are outside nodes' definitions, they are at top scope; if they are inside, they are at node scope. Top scope variables should be referenced with a :: prefix, for example, $::role. Node scope variables are available inside the node's classes with their plain name, for example, $role.
An ENC returns parameters, treated as top scope variables, alongside classes, and the logic of how they can be set depends entirely on its structure. Popular ENCs such as The Foreman, Puppet Dashboard, and the Puppet Enterprise Console allow users to set variables for single nodes or for groups, often in a hierarchical fashion. The kind and amount of parameters set here depend on how much information we want to manage on the ENC and how much to manage somewhere else.
LDAP, when used as a node's classifier, returns variables for each node as defined with the puppetVar attribute. They are all set at top scope.
In Hiera, we set keys that we can map to Puppet variables with the hiera(), hiera_array() and hiera_hash() functions inside our Puppet code. Puppet 3's data bindings automatically map class' parameters to Hiera keys, so for these cases, we don't have to explicitly use hiera* functions. The defined hierarchy determines how the keys' values change according to the values of other variables. On Hiera, ideally, we should place variables related to our infrastructure and credentials but not OS-related variables (they should stay in modules if we want them to be reusable).

A lot of documentation about Hiera shows sample hierarchies with facts such as osfamily and operatingsystem. In my very personal opinion, such variables should not stay there (weighting the hierarchy size), as OS differences should be managed in the classes and modules used and not in Hiera.
On public shared modules, we typically deal with OS-specific parameters. Modules should be considered as reusable components that know all about how to manage an application on different OS but nothing about custom logic. They should expose parameters and defines that allow users to determine their behavior and fit their own needs.

Unlock access to the largest independent learning library in Tech for FREE!

Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.

Renews at $19.99/month. Cancel anytime
On site module(s), we may place infrastructural parameters, credentials, and any custom logic, more or less based on other variables.
Finally, it's possible and generally recommended to create custom facts that identify the node directly from the agent. An example of this approach is a totally facts-driven infrastructure, where all the node-identifying variables, upon which all the other parameters are defined, are set as facts.

Defining the configuration files provided to the nodes

It's almost certain that we will need to manage configuration files with Puppet and that we need to store them somewhere, either as plain static files to serve via Puppet's fileserver functionality using the source argument of the File type or via .erb templates.

While it's possible to configure custom fileserver shares for static files and absolute paths for templates, it's definitively recommended to rely on the modules' autoloading conventions and place such files inside custom or public modules, unless we decide to use Hiera for them.

Configuration files, therefore, are typically placed in:

Public modules: These may provide default templates that use variables exposed as parameters by the modules' classes and defines. As users, we don't directly manage the module's template but the variables used inside it. A good and reusable module should allow us to override the default template with a custom one. In this case, our custom template should be placed in a site module. If we've forked a public shared module and maintain a custom version we might be tempted to place there all our custom files and templates. Doing so, we lose in reusability and gain, maybe, in short term usage simplicity.
Site module(s): These are, instead, a more correct place for custom files and templates, if we want to maintain a setup based on public shared modules, which are not forked, and custom site ones where all our stuff stays confined in a single or few modules. This allows us to recreate similar setups just by copying and modifying our site modules, as all our logic, files and resources are concentrated there.
Hiera: Thanks to the smart hiera-file backend, Hiera can be an interesting alternative place where to store configuration files, both static ones or templates. We can benefit of the hierarchy logic that works for us and can manage any kind of file without touching modules.
Custom fileserver mounts can be used to serve any kind of static files from any directory of the Puppet Master. They can be useful if we need to provide via Puppet files generated/managed by third-party scripts or tools. An entry in /etc/puppet/fileserver.conf like:
```
[data]
path /etc/puppet/static_files
allow *.example.com
```
Allows serving a file like /etc/puppet/static_files/generated/file.txt with the argument:
```
source => 'puppet:///data/generated/file.txt',
```

Defining custom resources and classes

We'll probably need to provide custom resources, which are not declared in the shared modules, to our nodes, because these resources are too specific. We'll probably want to create some grouping classes, for example, to manage the common baseline of resources and classes we want applied to all our nodes.

This is typically a bunch of custom code and logic that we have to place somewhere. The usual locations are as follows:

Shared modules: These are forked and modified to including custom resources; as already outlined, this approach doesn't pay in the long term.
Site module(s): These are preferred place-to-place custom stuff, included some classes where we can manage common baselines, role classes, and other containers' classes.
Hiera, partially, if we are fond of the create_resources function fed by hashes provided in Hiera. In this case, somewhere (in a site or shared module or maybe, even in site.pp), we have to place the create_resources statements.

The Foreman

The Foreman is definitively the biggest open source software product related to Puppet and not directly developed by Puppet Labs.

The project was started by Ohad Levy, who now works at Red Hat and leads its development, supported by a great team of internal employees and community members.

The Foreman can work as a Puppet ENC and reporting tool; it presents an alternative to the Inventory System, and most of all, it can manage the whole lifecycle of the system, from provisioning to configuration and decommissioning.

Some of its features have been quite ahead of their times. For example, the foreman() function made possible for a long time what is done now with the puppetdbquery module.

It allows direct query of all the data gathered by The Foreman: facts, nodes classification, and Puppet-run reports.

Let's look at this example that assigns to the $web_servers variable the list of hosts that belong to the web hostgroup, which have reported successfully in the last hour:

$web_servers = foreman("hosts",
 "hostgroup ~ web and status.failed = 0 and last_report < "1 hour ago"")

This was possible long before PuppetDB was even conceived.

The Foreman really deserves at least a book by itself, so here, we will just summarize its features and explore how it can fit in a Puppet architecture.

We can decide which components to use:

Systems provisioning and life-cycle management
Nodes IP addressing and naming
The Puppet ENC function based on a complete web interface
Management of client certificates on the Puppet Master
The Puppet reporting function with a powerful query interface
The Facts querying function, equivalent to the Puppet Inventory system

For some of these features, we may need to install Foreman's Smart Proxies on some infrastructural servers. The proxies are registered on the central Foreman server and provide a way to remotely control relevant services (DHCP, PXE, DNS, Puppet Master, and so on).

The Web GUI based on Rails is quite complete and appealing, but it might prove cumbersome when we have to deal with a large number of nodes. For this reason, we can also manage Foreman via the CLI.

The original foreman-cli command has been around for years but is now deprecated for the new hammer (https://github.com/theforeman/hammer-cli) with the Foreman plugin, which is very versatile and powerful as it allows us to manage, via the command line, most of what we can do on the web interface.

Roles and profiles

In 2012, Craig Dunn wrote a blog post (http://www.craigdunn.org/2012/05/239/) that quickly became a point of reference on how to organize Puppet code. He discussed his concept of roles and profiles. The role describes what the server represents, a live web server, a development web server, a mail server, and so on. Each node can have one and only one role. Note that in his post, he manages environments inside roles (two web servers on two different environments have two different roles):

node www1 { 
  include ::role::www::dev
}
node www2 { 
  include ::role::www::live
}
node smtp1 { 
  include ::role::mailserver
}

Then, he introduces the concept of profiles, which include and manage modules to define a logical technical stack. A role can include one or more profiles:

class role { 
  include profile::base
}
class role::www inherits role {
  include ::profile::tomcat
}

In environment-related subroles, we can manage the exceptions we need (here, for example, the www::dev role includes both the database and webserver::dev profiles):

class role::www::dev inherits role::www { 
  include ::profile::webserver::dev
  include ::profile::database
}
class role::www::live inherits role::www { 
  include ::profile::webserver::live
}

Usage of class inheritance here is not mandatory, but it is useful to minimize code duplication.

This model expects modules to be the only components where resources are actually defined and managed; they are supposed to be reusable (we use them without modifying them) and manage only the components they are written for.

In profiles, we can manage resources and the ordering of classes; we can initialize variables and use them as values for arguments in the declared classes, and we can generally benefit from having an extra layer of abstraction:

Class profile::base { 
  include ::networking
  include ::users 
}
class profile::tomcat { 
  class { '::jdk': } 
  class { '::tomcat': } 
}
class profile::webserver {
  class { '::httpd': } 
  class { '::php': } 
  class { '::memcache': } 
}

In profiles subclasses, we can manage exceptions or particular cases:

class profile::webserver::dev inherits profile::webserver { 
  Class['::php'] { 
    loglevel   => "debug"
  }
}

This model is quite flexible and has gained a lot of attention and endorsement from Puppet Labs. It's not the only approach that we can follow to organize the resources we need for our nodes in a sane way, but it's the current best practice and a good point of reference, as it formalizes the concept of role and exposes how we can organize and add layers of abstraction between our nodes and the used modules.

The data and the code

Hiera's crusade and possibly main reason to exist is data separation. In practical terms, this means to convert Puppet code like the following one:

$dns_server = $zone ? {
  'it'    => '1.2.3.4',
  default => '8.8.8.8',
}
class { '::resolver':
  server => $dns_servers,
}

Into something where there's no trace of local settings like:

$dns_server = hiera('dns_server')
class { '::resolver':
  server => $dns_servers,
}

With Puppet 3, the preceding code can be even more simplified with just the following line:

include ::resolver

This expects the resolver::server key evaluated as needed in our Hiera data sources.

The advantages of having data (in this case, the IP of the DNS server, whatever is the logic to elaborate it) in a separated place are clear:

We can manage and modify data without changing our code
Different people can work on data and code
Hiera's pluggable backend system dramatically enhances how and where data can be managed, allowing seamless integration with third-party tools and data sources
Code layout is simpler and more error proof
The lookup hierarchy is configurable

Nevertheless, there are a few little drawbacks or maybe, just the necessary side effects or needed evolutionary steps. They are as follows:

What we've learned about Puppet and used to do without Hiera is obsolete
We don't see, directly in our code, the values we are using
We have two different places where we can look to understand what code does
We need to set the variables we use in our hierarchy as top scope variables or facts, or anyway, we need to refer to them with a fixed fully qualified name
We might have to refactor a lot of existing code to move our data and logic into Hiera

A personal note: I've been quite a late jumper on the Hiera wagon. While developing modules with the ambition that they can be reusable, I decided I couldn't exclude users who weren't using this additional component. So, until Puppet 3 with Hiera integrated in it became mainstream, I didn't want to force the usage of Hiera in my code.

Now things are different. Puppet 3's data bindings change the whole scene, Hiera is deeply integrated and is here to stay, and so, even if we can happily live without using it, I would definitively recommend its usage in most of the cases.