6. Design of Experiment Generator¶

Design of Experiment (DOE) is an important activity for any scientist, engineer, or statistician planning to conduct experimental analysis. This exercise has become critical in this age of rapidly expanding field of data science and associated statistical modeling and machine learning. A well-planned DOE can give a researcher meaningful data set to act upon with optimal number of experiments preserving critical resources.

After all, aim of Data Science is essentially to conduct highest quality scientific investigation and modeling with real world data. And to do good science with data, one needs to collect it through carefully thought-out experiment to cover all corner cases and reduce any possible bias.

6.1. Quick start¶

Let’s say you have a design problem with the following table for the parameters range. Imagine this as a generic example of a chemical process in a manufacturing plant. You have 3 levels of Pressure, 3 levels of Temperature, 2 levels of FlowRate, and 2 levels of Time.

Pressure	Temperature	FlowRate	Time
40	290	0.2	5
50	320	0.3	8
70	350

First, import build module from the package,

from doepy import build

Then, try a simple example by building a full factorial design. We will use FULL FACTORIAL function for this. You have to pass a dictionary object to the function which encodes your experimental data.

FULL FACTORIAL(
{'Pressure':[40,55,70],
'Temperature':[290, 320, 350],
'Flow rate':[0.2,0.4],
'Time':[5,8]}
)

If you build a full-factorial DOE out of this, you should get a table with 3x3x2x2 = 36 entries.

Pressure	Temperature	FlowRate	Time
40	290	0.2	5
50	290	0.2	5
70	290	0.2	5
40	320	0.2	5
…	…	…	…
…	…	…	…
40	290	0.3	8
50	290	0.3	8
70	290	0.3	8
40	320	0.3	8
…	…	…	…
…	…	…	…

There are, of course, half-factorial designs to try!

6.2. Latin Hypercube design¶

Sometimes, a set of randomized design points within a given range could be attractive for the experimenter to asses the impact of the process variables on the output. Monte Carlo simulations are close example of this approach.

However, a Latin Hypercube design is better choice for experimental design rather than building a complete random matrix as it tries to subdivide the sample space in smaller cells and choose only one element out of each subcell. This way, a more uniform spreading’ of the random sample points can be obtained.

User can choose the density of sample points. For example, if we choose to generate a Latin Hypercube of 12 experiments from the same input files, that could look like,

SPACE FILLING DOE DESIGN(
      {'Pressure':[40,55,70],
      'Temperature':[290, 320, 350],
      'Flow rate':[0.2,0.4],
      'Time':[5,11]},
      num_samples = 12
      )

Pressure	Temperature	FlowRate	Time
63.16	313.32	0.37	10.52
61.16	343.88	0.23	5.04
57.83	327.46	0.35	9.47
68.61	309.81	0.35	8.39
66.01	301.29	0.22	6.34
45.76	347.97	0.27	6.94
40.48	320.72	0.29	9.68
51.46	293.35	0.20	7.11
43.63	334.92	0.30	7.66
47.87	339.68	0.26	8.59
55.28	317.68	0.39	5.61
53.99	297.07	0.32	10.43

Of course, there is no guarantee that you will get the same matrix if you run this function because this are randomly sampled, but you get the idea!

6.3. Designs available¶

Try any one of the following designs:

Full factorial ,
2-level fractional factorial ,
Plackett-Burman ,
Sukharev grid, Optimal strategies of the search for an extremum, A. G. Sukharev (1971).
Box-Behnken ,
Box-Wilson (Central-composite) with center-faced option ,
Box-Wilson (Central-composite) with center-inscribed option ,
Box-Wilson (Central-composite) with center-circumscribed option ,
Latin hypercube (simple) ,
Latin hypercube (space-filling) ,
Random k-means cluster ,
Maximin reconstruction ,
Halton sequence ,
Uniform random matrix

6.4. Simplified user interface¶

There are other DOE generators out there. But they generate n-dimensional arrays. doepy is built on the simple theme of being intuitive and easy to work with - for researchers, engineers, and social scientists of all background - not just the ones who can code.

User just needs to provide a simple CSV file with a single table of variables and their ranges (2-level i.e. min/max or 3-level).

Some of the functions work with 2-level min/max range while some others need 3-level ranges from the user (low-mid-high). Intelligence is built into the code to handle the case if the range input is not appropriate and to generate levels by simple linear interpolation from the given input.

The code will generate the DOE as per user’s choice and write the matrix in a CSV file on to the disk.

In this way, the only API user needs to be exposed to, are input and output CSV files. These files then can be used in any engineering simulator, software, process-control module, or fed into process equipment.

This document is derived from another project acknowledged at end of the documentation.

Pressure	Temperature	FlowRate	Time
40	290	0.2	5
50	290	0.2	5
70	290	0.2	5
40	320	0.2	5
…	…	…	…
…	…	…	…
40	290	0.3	8
50	290	0.3	8
70	290	0.3	8
40	320	0.3	8
…	…	…	…
…	…	…	…

Pressure	Temperature	FlowRate	Time
40	290	0.2	5
50	290	0.2	5
70	290	0.2	5
40	320	0.2	5
…	…	…	…
…	…	…	…
40	290	0.3	8
50	290	0.3	8
70	290	0.3	8
40	320	0.3	8
…	…	…	…
…	…	…	…

Pressure	Temperature	FlowRate	Time
40	290	0.2	5
50	290	0.2	5
70	290	0.2	5
40	320	0.2	5
…	…	…	…
…	…	…	…
40	290	0.3	8
50	290	0.3	8
70	290	0.3	8
40	320	0.3	8
…	…	…	…
…	…	…	…