Product Aesthetic Design: A Machine Learning Augmentation

Published Online:https://doi.org/10.1287/mksc.2022.1429

Abstract

Aesthetics are critically important to market acceptance. In the automotive industry, an improved aesthetic design can boost sales by 30% or more. Firms invest heavily in designing and testing aesthetics. A single automotive “theme clinic” can cost more than $100,000, and hundreds are conducted annually. We propose a model to augment the commonly used aesthetic design process by predicting aesthetic scores and automatically generating innovative and appealing product designs. The model combines a probabilistic variational autoencoder (VAE) with adversarial components from generative adversarial networks (GAN) and a supervised learning component. We train and evaluate the model with data from an automotive partner—images of 203 SUVs evaluated by targeted consumers and 180,000 high-quality unrated images. Our model predicts well the appeal of new aesthetic designs—43.5% improvement relative to a uniform baseline and substantial improvement over conventional machine learning models and pretrained deep neural networks. New automotive designs are generated in a controllable manner for use by design teams. We empirically verify that automatically generated designs are (1) appealing to consumers and (2) resemble designs that were introduced to the market five years after our data were collected. We provide an additional proof-of-concept application using open-source images of dining room chairs.

History: Puneet Manchanda served as the senior editor.

Funding: A. Burnap received support from General Motors to partially fund a postdoctoral research position for the research conducted in this work. He certifies that none of the research or its results were censored or obfuscated in its publication. J. Hauser and A. Timoshenko certify that they have no affiliations with or involvement in any organization or entity with any financial interest or non-financial interest in the subject matter or materials discussed in this manuscript.

Supplemental Material: The data files are available at https://doi.org/10.1287/mksc.2022.1429.

1. Introduction

Consumers consistently rank aesthetics among the three most important factors in product choice (Bloch 1995, Creusen and Schoormans 2005). For example, the visual design of the original iPod was judged to be a critical factor in its market acceptance (Reppel et al. 2006). In categories such as home appliances, aesthetics help firms establish product differentiation beyond functional characteristics (Bloch 1995, Crilly et al. 2004, Person et al. 2007); for instance, the Dyson DC01 used transparent design to communicate its complexity to consumers, helping it become the best-selling vacuum in the United Kingdom (Noble and Kumar 2010). Firms use aesthetics to strategically position and enhance brand recognition (Aaker and Keller 1990, Keller 2003, Karjalainen and Snelders 2010). Trade dress violations (nonfunctional attributes that signal brand identity) are hotly contested in Lanham Act (§43A) litigation. Aesthetics pervade marketing— visually appealing products and packaging drive consumers to choose one product over another, especially at the point of purchase in crowded brick-and-mortar stores, supermarkets, and online retailers (Clement 2007, Orth and Malkewitz 2008).

Developing product aesthetics can require substantial investment, yet returns on investment are found across markets—a study of 93 firms across nine product categories found that firms that heavily invested in aesthetic design had 32% higher earnings than industry averages (Hertenstein et al. 2005). Marketing and product managers routinely manage the aesthetic design of products, services, and digital marketplaces. In this paper, we propose a methodology to improve the process of aesthetic product design and testing. The basic concepts of the proposed methodology are applicable across product categories. Our research focuses on the automotive industry where we have the most experience and an industry partner; we provide an additional proof-of-concept application using publicly available data on furniture.

In the automotive industry, product aesthetics explain up to 60% of consumers’ purchase decisions (Kreuzbauer and Malter 2005). Automotive design significantly affects market performance (Cho et al. 2015, Rubera 2015, Jindal et al. 2016), in large part by influencing consumer consideration (Palazzolo and Feinberg 2015, Liu et al. 2017). For example, the redesign of the 2008 Buick Enclave commanded a 30% increase in manufacturer‘s suggested retail price over the Buick Rendezvous it replaced (using the same engine; Figure 1); the redesign of the 2005 Volkswagen Beetle resulted in a 54% market share gain in a single year (Kreuzbauer and Malter 2005, Blonigen et al. 2013). Conversely, the aesthetics of the 2001 Pontiac Aztec was cited as a primary reason for its market failure (Vlasic 2011). Not surprisingly, automotive firms invest heavily in design—$1.25 billion on average per model, and up to $3 billion for major redesigns involving both styling and platform (Pauwels et al. 2004, Blonigen et al. 2013, Rubera 2015).

Figure 1. (Color online) Three Otherwise Similar Automobiles with Different Aesthetic Design

Traditionally, human judgment drives the aesthetic design in at least two ways. First, although there are established aesthetic heuristics and cognitive design principles (Coates 2003, Crilly et al. 2004, Norman 2004), aesthetic design is often generated and screened by design teams who have an “eye” for visual design. Design teams are powerful within organizations; their aesthetic judgments are hard to overrule (Vlasic 2011).

Human judgment also affects aesthetic design through consumer evaluations. Firms often ask consumers to evaluate alternative designs in laboratory test markets, A/B tests, or “theme clinics.” In a typical automotive theme clinic, a few hundred targeted consumers are recruited and brought to a central location to evaluate aesthetic designs. Consumers view the aesthetic designs and rate them on established benchmarks such as semantic scales for sporty, appealing, innovative, and luxurious (Coates 2003). Theme clinics are costly. Automotive firms typically invest more than $100,000 per theme clinic for a single new vehicle design. With multiple aesthetic designs per vehicle and over a hundred vehicles in its worldwide product line, General Motors alone spends tens of millions of dollars on theme clinics. With additional costs incurred when designers screen hundreds of aesthetic designs down to those destined for theme clinics, annual costs can exceed $100 million for a single manufacturer.

We propose methods to augment the traditional product development process with machine learning tools to address both aspects of aesthetic design: (1) the generation and (2) the testing of new aesthetic designs. For testing, the model predicts how consumers would rate aesthetic designs directly from visual images. We demonstrate that the model can predict different semantic scales, such as aesthetic appeal, innovativeness, or traditional versus modern. Specifically, we use an encoding model to represent visual designs (images) using 512-dimensional embeddings and train a predictive model that predicts aesthetic ratings based on those embeddings. The predictive model is designed to screen newly proposed aesthetic designs so that only the highest-potential designs need to be tested in theme clinics (or their equivalent for nonautomotive applications).

For generation, the generative model creates new product designs with attributes defined by the design team (e.g., “Cadillac-like”). This gives the designers a tool to morph through the design space and explore visual dimensions of consumer aesthetic perceptions. These generated designs can be rated by the predictive model to identify those with high aesthetic scores. In the automotive proof-of-concept, we demonstrate that generated images are controllable, realistic, and perceived by consumers as aesthetically appealing. Moreover, the model can be “creative”: When trained using model year 2010–2014 data, the model can generate automotive designs similar to those introduced in model year 2020.

Our research was influenced by senior design and marketing managers in the automotive industry interested in using machine learning methods to improve aesthetic product design. These managers suggested that we focus on augmentation of human expertise and creativity in the traditional aesthetic design workflow rather than automation. Design teams welcome augmentation, but organizational structure, history, and designers’ beliefs and training resist full automation (Coates 2003). Our experiences with real design teams guide the modeling decisions in the proposed approach.

2. Conceptual Model of Product Aesthetic Design

2.1. Augmenting the Design Process

Figure 2 summarizes the current widely used design process and the proposed machine learning augmentation. Consider first the current process as shown in the first two rows (coded as black online). The process begins with a market definition that is external to the aesthetic design efforts. For example, Apple targeted smartphones with a touchscreen; Zenni Optical aimed to develop prescription sports glasses; and IKEA targeted affordable yet aesthetically pleasing furniture. In automotive markets, firms target particular segments such as luxury compact utility vehicles (currently the Cadillac XT5, Buick Envision, Volvo XC60, BMW X3, and others). Market definition provides soft constraints on aesthetics based on the targeted consumers and the firm’s capabilities (Box 1).

Figure 2. (Color online) Augmenting Aesthetic Design with Machine Learning

Aesthetic designers create hundreds to thousands of freehand sketches that are converted to two-dimensional (2D) images (Box 2; Coates 2003). For example, Dyson and General Motors generate several hundred sketches per new device or vehicle, whereas IKEA can generate fewer sketches given its product line variety and turnover (Bouchard et al. 2006, Toffoletto 2013). The human design team next screens potential designs to a smaller set of testable design concepts in a process known in design as “down-selection” (Box 3). Consumers evaluate the testable designs in theme clinics resulting in more screening. Successful designs are advanced downstream for further development, including engineering, manufacturing, and marketing communications (advertising, social media, websites). The process is highly iterative and asynchronous across both design concept generation and testing.

The trapezoids and arrows (double-lines online) highlight the proposed machine learning augmentation. The machine learning models augment the traditional design process and apply to all iterations in concept generation and testing. In testing, the predictive model helps eliminate designs likely to score low in theme clinics. Focusing on high-potential product images improves the traditional design workflow in several ways. Accurate prediction provides quick feedback and enables faster iterations by the design team. Theme clinics also become more efficient and effective because less respondent time is allocated to images with low predicted scores (Gross 1972). As a result, firms benefit from shorter product development times and cost reductions due to a lower product “drop rate,” that is, the share of design concepts later terminated in downstream stages (Cooper 1990, Danneels and Kleinschmidtb 2001). Finally, rigorous quantitative evaluation helps “shield” aesthetic designs from downstream changes driven by engineering, manufacturing, or accounting (Hartley 1996, Vlasic 2011).

The generative model creates designs that are realistic and screened by the predictive model to be aesthetically pleasing. Generated designs are intended to spark creativity among human designers, who can use the model as a tool to explore the space of possible aesthetic designs (Martindale 1990). The designers “control” the generative model through specifying attributes. Example attributes are red, Cadillac-like, sport utility vehicle (SUV), 2015 vintage, or viewed from the side. The vintage variable is important when training the model in our empirical application. It captures the evolving aesthetics in the automotive industry (Martindale 1990, Hekkert et al. 2003).

2.2. Technical and Managerial Challenges in Augmenting Aesthetic Design

The efficient augmentation of the traditional aesthetic design pipeline with machine learning tools requires that we address technical and managerial challenges. First, images pose a technical challenge as they are inherently high dimensional. Even modest quality images are 100 × 100 pixels for each of red, green, and blue colors together comprising 30,000 variables—far too many to be input to conventional choice models. Previous work has represented aesthetics in choice models using hand-engineered features such as characteristic lines (Ranscombe et al. 2012, Chan et al. 2018), landmark points (Landwehr et al. 2011), silhouettes (Orsborn et al. 2009, Reid et al. 2010), and Bezier curves (Kang et al. 2019) or aggregated numbers such as J.D. Power APEAL or online reviews (Pauwels et al. 2004, Cho et al. 2015, Homburg et al. 2015, Jindal et al. 2016). Despite the challenges of working with images, we follow the industry standard providing realism to designers who think naturally in terms of holistic images. Images are, to designers and consumers, realistic representations of new product aesthetics.1

Second, gathering consumer evaluations is costly and results in limited training data. In our automotive application, we are fortunate to have 7,308 aesthetic ratings by consumers for 203 vehicles, but those ratings alone would be insufficient to estimate a predictive and/or generative model with high-dimensional image data.

We address these practical challenges by training an encoding model to embed images in a lower-dimensional vector space. Embeddings reduce the dimensionality of the images for the predictive and generative models by leveraging both the relatively thin and expensive labeled training data (images with aesthetic ratings) with a much larger sample of unlabeled training data (180,000 images without consumer evaluations).2 Success depends on whether the embeddings compress the important information from the full images while allowing us to predict human aesthetic judgments and generate perceptually realistic new designs.

Embeddings using a neural network have seen recent adoption in marketing science. For example, Timoshenko and Hauser (2019) embed textual data to identify consumer needs; Liu et al. (2019) embed product reviews to predict sales conversion; Liu et al. (2020) embed social media images to predict identity; Dew et al. (2022) embed firms’ logos to describe brand personality and similarity; Gabel and Timoshenko (2022) and Feldman et al. (2022) embed purchase histories to predict product choice in retail; and Chakraborty et al. (2022) embed reviews to identify sentiment and missing evaluations. In this paper, we embed product images to predict aesthetic ratings and generate new aesthetic designs.

Third, aesthetic evaluations are holistic (Berlyne 1971, Martindale 1990, Bloch 1995, Crilly et al. 2004). Design aspects are interdependent; we cannot expect consumers to evaluate the design aspects separately as would be done in conjoint analysis (Orme and Chrzan 2017). For example, a consumer cannot evaluate the aesthetics of a new BMW X3 design as an additive sum of the shape and position of headlights, the slope of the hood, and the height of the beltline. Rather, the Gestalt interplay of all design elements, including subtle elements such as the “Hoffmeister kink,” drive consumer evaluations of qualitative attributes such as appealing, sporty, aggressive, luxurious, or modern (Coates 2003). By using deep neural networks for the encoding, predictive, and generative models, we automatically and holistically capture the interplay of aesthetic elements.

Fourth, the aesthetic design process is highly iterative, asynchronous, and distributed. This poses a significant managerial challenge—multiple design teams (and subteams) must be able to use the machine learning augmentation for their corresponding roles in design concept testing and generation. Integrating machine learning into the existing aesthetic design process must delicately balance its interplay with the established workflows. For example, the design team may split into subteams to work on several promising design concepts in parallel, whereas concurrent theme clinics may be testing entirely different design concepts. To enhance parallel development, once trained, our proposed predictive and generative models can be used independently or together as needed.

3. Overview of a Machine Learning Approach to Augment Product Aesthetic Design

Let Xi be the (height × width × color) three-dimensional (3D) tensor of pixels of image i. Our model requires two inputs: product images labeled with aesthetic ratings evaluated by consumers, {(Xi,yi)|i=1..NL}, and unlabeled product images without ratings but with product attributes, {(Xi,ai)|i=1..NU}. For example, yi might be the average consumer rating of the aesthetic appeal of product design i, and the attributes ai may describe primary exterior color, body type, model year, or brand. Firms typically obtain consumer evaluations for a small fraction of images, NLNU. Attributes are important for managerial acceptance because they enable the design team to control the design. However, the proposed model does not require that all attributes are available for every image. The model imputes any missing attributes during inference (Section 4).

Our two high-level goals are (1) to test new product aesthetics by predicting consumer ratings, y^new, for new product images created by the design teams, Xnew, and (2) generate new product designs, X^gen, according to attributes desired by the design team, agen, such that images score well on ratings, y^gen. We summarize notation in the appendix, Section A.1.

Figure 3 illustrates the general input and output flow of the proposed machine learning augmentation for aesthetic design. For every aesthetic design i, the encoding model inputs the image, Xi, and/or attributes, ai, and outputs a 512-dimensional embedding distribution q(h|Xi,ai). We sample the embedding vector hi from the distribution q(h|Xi,ai). The predictive model uses hi to predict aesthetic scores, y^i=ppred(y|hi). When predicting the aesthetic rating for a new image Xnew, we use the encoder to obtain an embedding distribution qnew(h)=q(h|Xnew,), and we average predictions over multiple draws of the embedding vectors, y^newEhiqnew(h)[ppred(y|hi)]. The generative model creates ima ges conditional on the embedding hi, meaning X^i=pgen(X|hi). When generating new aesthetic designs, we input a desired attribute vector agen into the encoder to obtain the distribution qgen(h)=q(h|,agen), sample embeddings hrqgen(h), and generate images X^gen=pgen(X|hr) for embeddings with high predicted appeal ppred(y|hr).

Figure 3. (Color online) Proposed Machine Learning Augmentation Model

The three models in the proposed approach—the generative model, encoding model, and predictive model—are connected by the probabilistic embedding. We learn an embedding distribution for each product design rather than a point estimate. Each aesthetic design i (from designers or automatically generated) has its own embedding distribution hiq(h|Xi,ai). Learning the parameters of the distribution leverages the variational inference literature (Jordan et al. 1999, Blei et al. 2017), enabling Bayesian parameter estimation at data sizes otherwise intractable for Monte Carlo Markov chain (MCMC) sampling.

The shared probabilistic embedding helps to effectively leverage big unlabeled data in training the predictive model. Intuitively, the labeled data from theme clinics alone are too thin to learn a mapping from high-dimensional product images to the aesthetic ratings. The unlabeled data contain information about the product itself (e.g., automobile images have four wheels). Our model relies on this information to train the probabilistic encoder, which then makes the prediction problem feasible with thin data.

We use deep neural networks to parametrize all three models. We combine and jointly minimize loss functions for the three deep neural networks using both labeled and unlabeled images. To be used by real design teams, the generative model produces images perceived as realistic. We gain realism with adversarial training. Adversarial training requires the generator create images the encoder perceives as real, whereas the encoder seeks to distinguish real from generated images. The end result of the minmax equilibrium leads to images that are realistic. For the remainder of paper, we refer to the predictive model, generative model, and encoding model also as the predictor, generator, and encoder, respectively.

4. Proposed Approach: Semi-supervised Variational Autoencoder with Adversarial Terms

We denote the parameters of the predictive model by βP, the generative model by βG, the encoding model by βE, and the combined vector of parameters by β=(βP,βG,βE). To train the model, we minimize the combined loss function:

L(β)=Lpred(βP)+Lgen(βG)+Lenc(βE),(1)
where Lpred(βP),Lgen(βG), and Lenc(βE) indicate the predictive, generative, and encoding loss functions, respectively. The summation of the loss terms is theoretically justified by the law of conditional probability and the principles of approximate marginalization of the likelihood formulation of the loss functions. However, we weight the various loss functions when training the model. We provide details for the probabilistic formulation of the VAE and for separability of the loss function in the appendix, Section A.2.

4.1. Loss Terms for the Predictive, Generative, and Encoding Models

4.1.1. Predictive Model.

We use a deep neural network, fP(hi,βP), to map embeddings to the rating of interest, say a one to five rating on aesthetic “appeal.” Information about the full images and attributes is summarized in the embeddings. For the predictive model, we define y^i=fP(hi,βP) as the predicted rating from the neural network; the loss term minimizes the mean absolute error of predicted versus true ratings. The mean absolute error definition was motivated by our industry application. Our automotive partner traditionally considers the mean absolute error in their analysis of aesthetic ratings. This loss function is consistent with an assumption that the observed ratings are drawn from a Laplace distribution with mean fP(hi,βP) and unit diversity in the probabilistic formulation (see the appendix, Section A.2).

Lpred(βP)=irated|yiyı^|(2)

4.1.2. Generative Model.

We use a deep neural network, fG(hi,βG), to generate an image X^i from an embedding hi. The loss function for the generative model combines two terms. The first term rewards the quality of image reconstruction. Given an embedding hi for an image Xi (labeled or unlabeled), we want the generative model to produce an image X^i that is similar to the original image Xi. This assures that the generated images are “vehicle-like.”

We use a second term to enhance the generative model using “masks.” A mask defines the general shape of a product, say “SUV-like.” Masks are matrices with binary values of the same height and width as the product images. The mask’s D pixels, Mi, indicate the presence of the product in the image. We use standard computer vision tools to create masks for all labeled and unlabeled images and show an example mask in the appendix, Section A.4. Masks focus the generative model on product designs rather than unrelated information in product images. In the generator, masks are analogous to a fourth color (red, green, blue, mask) and predicted by the same deep neural network using the (now-augmented) parameters, βG.

Although the generative model is used to generate new designs, it is trained on existing images. As detailed in the appendix, Section A.2, the mean absolute error loss function is consistent with the images (and masks) being drawn from a high-dimensional Laplace distribution with mean fG(hr,βG) and unit diversity. If d indexes the now-4D pixels, then the loss function for the generative model becomes

Lgen(βG)=irated,unrated{13Dd|xidx^id|+1Dd|midm^id|}.(3)

4.1.3. Encoding Model.

We use a deep neural network, fE(Xi,ai,βE), to map images, Xi, and product attributes, ai, to an embedding distribution, q^enc(h|Xi,ai). We assume a K-dimensional Gaussian distribution, q^enc(h|Xi,ai), with mean, μ^i(Xi,ai), and a diagonal covariance, σ^i(Xi,ai). The neural network, fE(Xi,ai,βE), estimates the distributional parameters, μ^i(Xi,ai) and σ^i(Xi,ai), for the image Xi and attributes ai.

The encoder loss function includes two terms. The first loss term rewards the encoder for estimating q^(|Xi,ai) that is close to the prior. This term acts to regularize the embedding and prevents the encoder from “cheating” by assigning each image to a unique subregion of the embedding space. If the encoder were to “cheat,” it would effectively memorize training data at the expense of generalizable performance. We use a standard normal distribution as a prior and measure the distance between distributions by the Kullback-Leibler (KL) divergence (Kingma and Welling 2013).

In practical applications, product attributes ai often contain missing values. The second term in the encoder imputes the missing values by estimating a multinomial classifier. The multinomial classifier is consistent with assuming a Dirichlet distribution of product attributes ai in the probabilistic formulation (see the appendix, Section A.2). For image Xi, the encoder neural network produces a probability, a^ic, that image Xi has attribute values aicl for each level, , of each attribute, c. For example, if Cadillac is a level () of the brand attribute (c), then a^ic, is the probability that image Xi is a Cadillac.

Putting these ideas together, the encoder loss function becomes

Lenc(βE)=irated,unrated{DKL(qenc(h|Xi,ai)N(0,I))+c=1C=1caicloga^ic}=irated,unrated{k=1K[12(μik2+σik2)logσik]+c=1C=1caicloga^ic},(4)
where DKL indicates the KL divergence, N(0,I) is the standard normal prior, and k indexes the embedding dimensions. The diagonal covariance structure of σ^i(Xi,ai) and standard normal prior provide a simple representation for the first term in the encoding model (Kingma and Welling 2013).

4.2. Modification with Adversarial Terms

Effective machine learning augmentation requires that the model encodes and generates images well. For example, if we are generating luxury crossover utility vehicles (luxury CUVs), the images should look like well-designed luxury CUVs. After extensive experimentation and tuning, we found it necessary to augment the VAE formulation presented in Section 4.1 using the concept of adversarial training found in generative adversarial networks (GANs).

The basic idea is that we reward the generative model for generating images with embedding distributions similar to the prior, and we reward the encoder for encoding the generated images with distributions distant from the prior. To achieve these joint goals, we implement competing adversarial objectives—a term in the generator is the negative of a term in the encoder, similar to Heljakka et al. (2019). We train the generator and encoder iteratively, so that the generator and encoder reach a minmax solution to a two-player game. That is, iterative training converges to a fixed point where generated images and actual images are both encoded to the same embedding space (Goodfellow et al. 2014, Ulyanov et al. 2018). Iterative training assures the adversarial terms do not simply cancel out:

Ladv(βE)=ggeneratedimagesDKL(qenc(h|Xg,ag)N(0,I))=ggeneratedimages{k=1K[12(μgk2+σgk2)logσgk]}.(5)

This approach to adversarial training differs from conventional GANs and VAE-GAN hybrids such as adversarial autoencoders in that we are not learning an implicit generative model (i.e., a likelihood-free generative model specification) by rewarding a “discriminator” to classify images as real or generated (Pan et al. 2017). We instead perform adversarial training in the embedding space much like feature matching and perceptual similarity approaches (Larsen et al. 2015). Our model is an explicit generative model (i.e., parametric assumptions of the embedding distribution) that ultimately aligns with our managerial use case—a smooth and controllable embedding that allows “creative” exploration by product designers.

4.3. Summary

The components of the loss terms described in Section 4.1 follow the VAE perspective. These components may be viewed as combining semi-supervised VAEs (Kingma et al. 2014) with conditional VAEs (Sohn et al. 2015). We augment the VAE perspective with adversarial methods (similar to GANs) to encourage the generator to produce realistic new images. In contrast to typical VAE-GAN approaches (Larsen et al. 2015, Zhao et al. 2016, Berthelot et al. 2017), adversarial autoencoders (Makhzani et al. 2015), and adversarial generative encoders (Heljakka et al. 2018, Ulyanov et al. 2018), we retain the probabilistic interpretation of the combined model. The proposed approach enables us to sample from the distribution implied by the generator, improves embeddings, and minimizes “posterior collapse” in VAEs.

Table 1 summarizes the loss functions and indicates which images are included in summations. The loss terms are weighted in summation to balance the quality of predictive and generative tasks and to improve model convergence (Section 5).

Table

Table 1. Predictive, Generative, and Encoding Loss Functions

Table 1. Predictive, Generative, and Encoding Loss Functions

Loss functionDataIntuition
Predictive model
|yiy^i|LabeledMAE term rewards the predictor for predicting ratings.
Generative model
13Dd|xidx^id|Labeled and unlabeledMAE reconstruction term rewards generator for generating images that are similar to true images.
1Dd|midm^id|Labeled and unlabeledMAE reconstruction term rewards generator for predicting masks that correctly segment the design within the image.
k=1K[12(μgk2+σgk2)logσgk]GeneratedAdversarial term rewards the generator for images with embedding distributions close to the prior. Summed over generated images only (g).
Encoding model (summed over rated, unrated images, or, when indicated, generated images)
k=1K[12(μik2+σik2)logσik]Labeled and unlabeledKL divergence rewards the encoder for embeddings close to the prior.
c=1C=1caicloga^icLabeled and unlabeledCross-entropy rewards encoder for predicting attributes from images.
k=1K[12(μgk2+σgk2)logσgk]GeneratedAdversarial term rewards encoder for encoding generated images with distributions far from the prior. Summed over generated images only (g).

5. Moving from Theory to Practical Implementation

Our proposed model differs from standard VAE approaches because we include information from product ratings and attributes, and we add masks and adversarial terms. These practical adjustments are necessary to enhance predictive ability and generate realistic new aesthetic designs; however, they create additional technical challenges in model training. We describe a proposed custom deep learning architecture and our approach to training the resulting model. Although the architecture and model hyperparameters are specific to our application, the principles behind the modeling and tuning decisions are general. Once trained, our model is rapid and easy to use. We provide a proof-of-concept application with SUVs in Sections 6 and 7 and discuss an additional proof-of-concept application with dining room chairs in Section 8.

5.1. Deep Learning Architecture

Figure 4 summarizes the deep neural network architectures for the predictive, generative, and encoding models. To simplify the description of the architectures, Figure 4 uses “blocks” of neural network layers, in which each “block” is made up of several neural network layers as described at the bottom of Figure 4. Each neural network layer (e.g., 2D convolution) performs the indicated operations on the outputs from the previous layers.

Figure 4. Deep Neural Network Architectures for Predictive, Generative, and Encoding Models

Starting from the left layer to the right, the 2D convolution layer takes image pixels as input (or the previous layer) and sweeps over patches of pixels using a sliding window of trainable multiplicative weights. This helps the model learn spatial correlations among neighboring pixels (or the analogue in higher layers). The spectral normalization layer acts as a regularization technique to control the magnitude of gradients during model training, thereby stabilizing model training (see Section 5.2). A leaky rectified linear layer acts as a nonlinear activation function to transform values from the previous layer, enabling the neural network to learn complex nonlinear interactions. A residual connection simply adds the original input from the first layer of the block to the now transformed features at the end of the block. In doing so, the intermediary layers learn the residual error from the previous block (Hu et al. 2018). 2D average pooling reduces the dimensionality of the previous layer by down-sampling patches of 2D features to a single value. Squeeze-and-excite explicitly models interdependencies across channels (e.g., red, green, blue (RGB) channels for the input layer) performing “self-attention.” The appendix, Section A.15, provides brief definitions of technical terms used in this paragraph and elsewhere in this paper.

When generating a new image, attributes are run through layers to obtain the μ^’s and σ^’s, which, in turn, generate the predicted encoding hr. Additional layers are not needed in the generator for existing images, hence the dotted box on the left side of the generative model. The predictive model does not have “blocks” but instead fully connected and rectified layers. Last, the custom deep learning model requires hyperparameter “tuning,” so we hold out data for model selection. We randomly split data into training, validation, and testing sets using a seeded random number generator for reproducibility and statistics. Validation data were used to set model hyperparameters (e.g., learning rates) and monitor training progress for model selection. Testing data were used only for model evaluation.

5.2. Stabilization and Tuning of Model Training

We train the model using first-order optimizers (see the appendix, Section A.3). Deep learning models are often challenging to train because of the large numbers of parameters to be estimated. This is especially true for the models that include adversarial loss-function terms, as the adversarial components can increase training instability (Gulrajani et al. 2017). The appendix, Section A.7, provides two examples of training instability: gradient explosion and posterior collapse. Gradient explosion results in images unrecognizable as vehicles. Posterior collapse results in images that are all the same.

We apply several techniques to stabilize model training. We avoid unbounded “backpropagated” gradients that lead to catastrophic failures in model training (e.g., outputting images of only white pixels) by stabilizing training with the spectral normalization layer in each block. This helps to regularize the model by dividing the raw output weights of each layer by the largest singular value of the matrix of weights (Miyato et al. 2018). We also enforce both soft and hard constraints on the model architecture. Specifically, we bound the variance of the Gaussian random variables in the KL-divergence terms and use floating point safeguards to ameliorate the potential numerical instability introduced by logarithms and various Lp-norms.

We tune model training by scaling the contribution of each loss term in Table 1 with user-defined multiplicative weights. This creates a balance between the seven loss terms given their interdependency during training. These weights stabilize model training and are chosen by monitoring convergence metrics on the training data (e.g., the gradient stability), predictive accuracy on the validation set, and the realism of the generated images during training. For example, we overcame a primary source of training instability which occurred when the KL-divergence of the generated images overwhelmed the KL-divergence of the observed images plus the reconstruction loss. To avoid this problem in the best way, we scaled the KL-divergence loss terms within the range of 1/10 to 1/20 relative to the loss terms for reconstructing images, and annealed these terms from zero at the start of training to a maximum value after model convergence. We further control the difference between the KL-divergence of observed images and the KL-divergence of generated images at each training iteration using a fixed margin (Huang et al. 2018).

We last use “progressive training” as proposed by Karras et al. (2017). In this approach, we begin training the model at the lowest image resolution in the data, 4 × 4 pixels, and progressively increase the resolution of input and output images, in stages, until we reach the desired image resolution of 512 × 512 pixels. We denote resolution stages by their height and width, for example, 4 × 4 pixels or 32 × 32 pixels. As one resolution stage of model training progresses into another, we anneal in the next (larger) image resolution by taking a linear combination of two images, an upsampled version of the lower-resolution image, and the higher resolution image. For each resolution stage, we train the model for multiple iterations, in which we smoothly increase the weight of the higher-resolution image in the linear combination. The number of iterations is a hyperparameter controlled using validation data (we trained for 1 million iterations per stage). The advantage of progressive training is that, rather than starting model training from a completely blank slate at full resolution, the model learns information about the aesthetic design at lower resolutions. Incremental learning improves stability of model training and reduces training time. The appendix, Section A.5, provides examples of the generated images at different stages of progressive training.

5.3. Computational Resources

Training the model takes roughly one to two weeks using a multi-GPU workstation (4 × Tesla V100 or 4 × Quadro RTX 8000), with major computational bottlenecks being both GPU clock speed and GPU memory. Because GPU memory is the main determinant of batch size, we opted for GPUs with either 32 or 48 GB of VRAM. We found that larger batch sizes aided training stability, particularly at the largest progressive training resolutions (e.g., batch size of 48 parallelized over 4 GPUs for resolution of 512 × 512).

We expect training times will decrease with subsequent applications and continual advances in machine learning and computational capabilities. Overall training time depends on a variety of factors including (1) the size of the model itself (the number of trainable parameters and in particular, the dimensionality of the embedding), (2) the number of iterations in progressive training stages, and (3) instabilities that require adjustments to the learning rates and the loss term scaling.

We tuned the model to balance training feasibility and the quality of the output. For example, we used the 512-dimensional embedding space in our automotive proof-of-concept. Fewer dimensions limited the model’s ability to encode the information from the images and capture nuances in the aesthetic designs. Larger dimensionality required more training time without additional benefits to the quality of the predictor or generator. We provide examples of the under- and overparameterized models in the appendix, Section A.8. These tuning decisions might be different for other industry applications.

The monetary costs of the proposed machine learning augmentation are low compared with the multimillion-dollar budgets allocated to improving product aesthetic designs (Section 1). We trained the model on a workstation that costs less than $10,000. After the model is trained, applying the model does not require machine learning expertise and can be done using standard corporate laptops. The data required for training the model are often routinely available within organizations.

5.4. Potential Model Extensions

Many applications of deep learning use a “pretrained model,” originally trained on a different task (e.g., object detection) and repurpose it for the desired task (e.g., aesthetic rating prediction) by retraining it on the desired task’s data. Prior marketing science research has applied pretrained models to image data for identifying product returns and brand identity (Dzyabura et al. 2018, Liu et al. 2020) and to textual data for identifying customer needs and predicting sales conversion (Liu et al. 2019, Timoshenko and Hauser 2019). However, pretrained models generally do not exist for generative models. This is particularly true if the generative model has domain-specific requirements, such as the managerial challenges in Section 2.2 of limited labeled data and ability to generate new aesthetically appealing images controllable by designers. The demands of interconnecting attributes, masks, ratings, high-resolution images, and adversarial training for generation required a custom architecture.

Our proposed model architecture supports extensions for applications to different industries and aesthetic design objectives. For simplicity of exposition, we presented a model trained for a single aesthetic rating, however design teams are often interested in multiple metrics for aesthetics. For example, in the automotive industry, firms often consider aesthetic “appeal,” “sportiness,” and “innovativeness” among other aesthetic attributes and collect such ratings in theme clinics. We can extend our model to multiple aesthetic output measures by leveraging the shared embedding space produced by the encoder and calibrating separate predictor networks for each output measure (see Section 7.2).

Separability in the use of the predictor, generator, and encoder was motivated by the managerial challenge of fitting machine learning into the traditional design process. Separability is also beneficial in accommodating nonstationarity in customer preferences. For example, aesthetic design of many of the cars popular in 2000s may seem outdated in 2020s. Because firms routinely conduct theme clinics, the predictive model can be regularly updated using more-recent data to account for nonstationarity in preferences. This can be done without retraining the encoder or generator: the updated predictor will prioritize different areas of the product space, while the “definition of the vehicle” in the encoder and generator stays unchanged.

Our model is widely applicable, but not without limit. Consider smartphones. The introduction of iPhone changed the definition of aesthetics from the button-based phone to the touchscreen. Apple defined the look as a “black oily pond.” Adjusting for the product definitions requires retraining the entire machine learning augmentation. Fortunately, we can use the lessons learned in this initial application to improve tuning in future applications. We also expect tuning to get easier with further developments in hardware, deep learning frameworks, and transfer learning methods.

6. Case Study: Aesthetic Design of U.S. Automotive Market SUVs/CUVs

Our machine learning augmentation has two goals: (1) predict consumer evaluations of aesthetics for proposed designs and (2) generate innovative aesthetically appealing designs that spark creativity among the design team and design management. We calibrate and evaluate the proposed model using unique data provided by our automotive partner.

6.1. Data: Images Rated in Theme Clinics

We obtained aesthetic ratings for the unique images of 203 SUVs/CUVs from model years (MY) 2010–2014 tested in our partner’s theme clinics. Following established procedures that evolved over decades (and market research in general), the firm used screening questions to target consumers who intended to purchase in the target category in the near future (intenders), who were willing to evaluate aesthetics, and who were screened for sufficient attention and consistency. Respondents were incentive-aligned using standard methods, both fiscally and with knowledge their input would guide future aesthetics (Ding et al. 2011). The details of the screening questions and incentive alignment are proprietary to the firm.

A web-based survey was colocated with the theme clinics. Warm-up questions motivated respondents that their ratings would affect aesthetic design and introduced our partner’s previously calibrated pairwise-semantic-differential rating scale for aesthetic “appeal” (i.e., most unappealing to most appealing). The survey anchored respondents’ ratings by asking each respondent to rate five prechosen pairs of images—prechosen in pretests to be most divisive on the pairwise ratings scale. The five image pairs were the same for all respondents, but randomly counterbalanced among respondents.

Each respondent rated ten sequential pages of SUVs/CUVs of five images per page. Each SUV/CUV image was presented from the side viewpoint. To test respondent consistency, the second and eighth page and the third and ninth page contained the same images randomly ordered. After extensive pretesting, the survey was implemented by our automotive partner. To maintain consistency among images and mitigate image-color biases, all images were reduced to greyscale and shown with a side viewpoint. The appendix, Section A.6, provides an example rating page. Although we cannot rule out a halo effect, it was unlikely—respondents rated many images and were unlikely to know the marketplace success of each vehicle.

To maintain data quality prior to any further analysis, we eliminated respondents who were judged to be inconsistent based on Krippendorff’s α where α= 1 – (observed disagreement among like images)/(expected disagreement due to chance). Krippendorff’s α is a generalization of other interrater reliability measures such as Fleiss’ κ and Cohen’s κ (Krippendorff 2011). The cutoff was α=0.75, which eliminated 38 respondents (21%). This percentage is consistent with those reported in the literature on eliminating inattentive respondents (Oppenheimer et al. 2009, Hauser and Schwarz 2015). This literature establishes that such elimination procedures result in higher-quality data that is not biased by elimination. Ratings (7,308) from consistent users were aggregated to a mean value for each of the 203 unique SUVs in model year 2010–14. We chose to focus on the side viewpoint in this work as a proof-of-concept. Respondents rated vehicles from the side viewpoint, and the same (mean value for a vehicle) was assigned to all viewpoints ±20 azimuth degrees from the sideview in the training data to increase sample size. We evaluate the prediction methods using only the single side viewpoint of labeled images.3

In our analysis, we randomly split the rated data into training, validation, and test data at a ratio of 50%:25%:25%. We used three random splits of the data to allow calibrated standard deviations of the predictive results. Each random split was stratified by considering the make and model of the vehicles. For example, for a given random split, all Jeep Wranglers were in either the training, validation, or test data. This stratified splitting avoids data leakage across year-make-model combinations of vehicles in the data.4

6.2. Data: Unrated Full-Color Images

We obtained access to industry-standard high-quality images available from aggregators. These images are often used by automotive firms in their marketing communications. The typical “rental” cost is about $50,000 per month. We obtained 180,000 unlabeled images of 4,984 unique vehicles across several segments (e.g., sedans, trucks, SUVs). All images were rescaled to 512 × 512 pixel resolution. We used conventional computer vision tools to obtain masks (GrabCut), car color, and viewpoint. Unlabeled images included product attributes such as brand, body type, model, and model year. We describe the available attributes in the appendix, Section A.9.

The unlabeled images were randomly split according by the same process as the labeled images. Unique vehicles held out in the validation and test sets of the labeled image data were held out from the unlabeled image data, thereby ensuring the model never had access to these vehicle images during training. The vast majority of the unlabeled images remained in the training set, because the number of unique vehicles in the unlabeled images dwarf the number of unique SUVs/CUVs in the labeled images.

7. Evaluation of the Machine Learning Augmentation

We evaluate the ability of the predictor to predict the aesthetic ratings of the held-out vehicles. We evaluate the generator on the face validity of the generated images, the ability to generate images with high aesthetic ratings, the ability to motivate the descriptive insights for new designs, and the ability to generate images comparable to MY2020 vehicles that were introduced to the market five to six years after MY2010–14. Our data contain images with different viewpoints, so the generator can create new designs with different rotational angles. However, the aesthetic ratings are only available to greyscale images from the side viewpoint (Section 6.1). We train and evaluate the predictor using these data.

7.1. Predictive Ability

Figure 5 illustrates predictions for eight SUVs/CUVs5; we report the mean absolute error (MAE) for predicted versus actual ratings on random splits of the held-out data in Table 2. Our model yields a MAE of 0.350 of a scale point, an improvement of 43.5% over the naïve (uniform) baseline. To put the MAE of the proposed model in perspective, we compare its predictive ability to a series of benchmark models that vary from naïve to sophisticated. We used the same training and validation data to develop the benchmarks and optimized their hyperparameters to provide meaningful comparison.

Figure 5. (Color online) Examples of Predictive Accuracy of Machine Learning Augmentation for Aesthetic Appeal Score
Table

Table 2. Predictive Test of Machine Learning Augmentation vs. Baselines and Benchmarks

Table 2. Predictive Test of Machine Learning Augmentation vs. Baselines and Benchmarks

Prediction modelMean absolute error (standard deviation)Improvement
Baseline: Median rating in training images (constant rating)0.620 (0.043)0.0%
Benchmark: Computer vision features and random forest (conventional machine learning)0.446 (0.047)28.1%
Benchmark: VGG16 with fine-tuned final layers (pretrained deep learning)0.405 (0.039)34.7%
Proposed machine learning augmentation (custom deep learning)0.350 (0.043)43.5%

7.1.1. Uniform Baseline.

The most naïve baseline is that respondents select the scale midpoint. This baseline represents zero information. A less naïve baseline uses global information from the training samples to calculate the median rating. To be conservative, we use this less-naïve uniform baseline.

7.1.2. Sophisticated Benchmark 1: Random Forest and Computer Vision Features.

Computer vision and machine learning have a long history of processing high-dimensional image and video data for object detection and image segmentation. Conventional approaches reduce high-dimensional visual data to a small set of “hand-engineered” features, which are then input to machine learning methods such as support vector machines.

Our benchmark uses three types of hand-engineered features from computer vision. (1) Histograms of oriented gradients (HOG) features encode edge and shape information. HOG features divide the image into a grid of image patches, calculate the gradients of each patch, and bin these gradients into a histogram. Edge orientation and shape intensity are contained in the gradients’ direction and magnitude values. (2) A downscaled version of the image itself (e.g., 512 × 512 to 32 × 32). (3) Histograms of color values for each RGB image channel. These features are used in a random forest with 100 trees. We present a random forest because it performed best when tested against other common machine learning approaches: support vector machines, Gaussian process regression, and L1/L2-regularized linear regression.

7.1.3. Sophisticated Benchmark 2: Pretrained Deep Learning Model.

Many researchers use “pretrained” open-source neural networks trained for one prediction task and repurposed for another prediction task. As a sophisticated benchmark, we used the pretrained VGG16 deep learning model trained on the ImageNet database (Simonyan and Zisserman 2014). This benchmark outperformed other common pretrained models (e.g., ResNet50, InceptionV3, YOLOv5) for our prediction task, a finding consistent with similar tasks in the machine learning literature (Zhang et al. 2018). The VGG16 model is a pyramid of sixteen stacked layers (13 convolutional and 3 fully connected) that sequentially reduce images in size until they are classified in the last layer.

The initial layers of VGG16 transform pixels to edges and lines found in visual images. For our benchmark, we maintain the initial “pretrained” layers and replace the last classification layer with two batch-normalized rectified-linear layers followed by a regression layer. This architecture was chosen using validation data and mirrors the predictive model in our proposed approach. We train the model in two steps. We first freeze the pretrained layers and train only the new layers, and then we “fine-tune” the entire neural network by training all layers. The two-step procedure improves the benchmark’s prediction.6

7.1.4. Results.

Table 2 compares the predictive performance of our proposed machine learning approach to a naïve baseline (the median rating in the training data) and the sophisticated benchmarks for predicting the aesthetic appeal score. The proposed machine learning augmentation outperforms the other methods.7 Contextualizing this improvement is important. For some applications, particularly those that predict without generation, pretrained networks may be enough. For many firms, however, product design is a multimillion-dollar investment decision, and even a small improvement in precision is valuable as is integration with generation. Our research partner judges that our model’s predictions are sufficiently accurate to provide a viable alternative for initial screening prior to formal theme clinics.

We can evaluate how sensitive the predictive performance of the proposed model and benchmarks to the amount of the training data in our context. We train the models with random subsamples of the labeled and random subsamples of the unlabeled data, and then report the out-of-sample MAE in the appendix, Sections A.11 and A.12. The number of labeled images is relatively sparse compared with the unlabeled images. As we reduce the size of the labeled data, the predictive performances of both the proposed model and the benchmark pretrained model deteriorate, and the MAE of the proposed model with 10% labeled data are below 0.5 points on the five-point scale. The unlabeled images are plentiful. The predictive performance remains similar to the full-data benchmarks down to 10% of the unlabeled data. We notice that, although predictive performance is relatively insensitive to the number of unlabeled images at this scale, more images enhance the quality and ease of training of the generator (see the appendix, Section A12).

7.2. Generative Capability

By its very nature, the quality of a generated image, and its usefulness to managers and designers, is subjective. Full evaluation is likely to take many years as machine learning augmentation becomes part of an ongoing design process, as new vehicles are launched to the market using the augmented design process, and as we observe market acceptance. An A/B experiment through to market launch is not feasible given organizational constraints and the billion-dollar costs of launching redesigned A versus B vehicles. The best we might hope for is a natural experiment where one suborganization adopts the model and another does not (e.g., Griffin and Hauser (1992) for House-of-Quality adoption at an automotive manufacturer). Even then, only organizational judgments were feasible. At this time, we triangulate the value of the generated images in four ways: face validity, consumer evaluations of generated designs, managerial judgment, and the ability to generate images that are close to innovative vehicles that have been launched after the time during which the training data were obtained.

7.2.1. Face Validity: Controllably Generating Images.

Our first test is whether the proposed approach can create realistic images controllable by attributes (e.g., body type). We begin by sampling points in the embedding space, conditioned on desired attributes, and then move smoothly around that space. We use spherical linear interpolation to sample new points. For each point in the embedding space, we generate a high-dimensional image. The images are realistic and can be morphed in a controllable manner that mimics the manner in which design teams create designs. So that the reader may judge, we provide examples in Figure 6 and demonstration videos of SUV/CUV morphing at https://vimeo.com/497011714/. We demonstrate controllability by morphing other body types at https://vimeo.com/334094197.

Figure 6. (Color online) Examples of Generated Designs
Notes. First row: SUV/CUV. Second row: Sedan to truck. Third row: Rotation.

7.2.2. Generating Appealing Images: Consumer Evaluations.

To test the ability of the model to produce aesthetically appealing images, we generated 50 targeted images: 25 were predicted by the predictor to be rated highly and 25 were predicted to be rated poorly. Figure 7 provides examples of generated images of each type. For consistency with the training data available to the predictor, we generated each image to be a light gray SUV/CUV from the side view. The generated designs were created by using spherical interpolation between existing designs to ensure plausibility to respondents and to mitigate biases (Lopez et al. 2019). Following Section 6.1, we used respondents from a professional Internet panel (ProdegeMR, at $4 per respondent) to evaluate the aesthetic appeal over randomly selected pairs of the generated designs. Following industry standards (also used by our industry partner), we screened respondents to be SUV/CUV “intenders.”

Figure 7. (Color online) Example of Generated Designs for Consumer Evaluation and Augmenting Managers

We pretested the survey carefully. The initial sample was 358 respondents. Following suggested practice and prior to any analysis of the data, we used instructional manipulation checks (IMCs) to eliminate 116 inattentive and/or “professional” respondents (Oppenheimer et al. 2009). In particular, respondents were eliminated if they were not SUV intenders, answered too quickly or too slowly, responded with “straight-line” patterns, or failed “trap” questions that tested for attention. Our elimination rate is typical of industry and academic experience—see review in Morren and Paas (2020). IMCs increase the reliability of survey data and encourage respondents to think hard (Oppenheimer et al. 2009, Hauser and Schwarz 2015). The final screen to 181 respondents eliminated an additional 61 respondents who were inconsistent in answering repeated binary choice questions. (We built the repeated questions into the survey before the survey was fielded.)

The consumer evaluations suggest that respondents judged as aesthetically appealing images that the predictor forecast to be aesthetically appealing and judged as aesthetically unappealing images that the predictor forecast to be aesthetically unappealing. Specifically, the predictor and consumers agreed 74.0% of the time.

7.2.3. Other Aesthetic Metrics: Innovativeness.

Firms are often interested in multiple aesthetic metrics, and our model can be easily recalibrated for different metrics. Our industry partner provided ratings for aesthetic “innovativeness” over the same unique 203 SUVs/CUVs in the labeled data as for aesthetic “appeal.” We recalibrated the model to predict aesthetic innovativeness by finetuning a previously trained model of aesthetic appeal using the aesthetic ratings for “innovativeness” (see the appendix, Section A.14, for details). We then sampled SUV/CUV designs from the embedding space. Figure 7 illustrates vehicles with predicted high and low aesthetic appeal and with predicted high and low aesthetic innovativeness.

Figure 7 highlights that aesthetically appealing and aesthetically innovative designs can take very different aesthetic forms. One cannot exhaustively describe the differences between the designs in Figure 7 with a small set of attributes. Generated images are instead intended to visually showcase these differences to guide management and spark creativity among designers.

7.2.4. Augmenting Managers: Can the Model Guide Design Exploration?

Design managers balance current consumer preferences with designers’ creative visions of the future. Firms aim to strike a delicate balance between a vehicle’s aesthetic appeal and aesthetic innovativeness: a notion supported by academic literature—too much aesthetic innovativeness and the product is overly avant-garde and unlikely to be appealing to a large market; too little innovativeness and the product quickly becomes stale with lacking competitive advantage (Landwehr et al. 2011, Toubia and Netzer 2017).

We showed the images in Figure 7 to senior managers at our automotive partner who were responsible for evaluating aesthetic design.8 These managers immediately recognized differences between the generated designs and attempted to identify design factors associated with aesthetic innovativeness for new vehicles (e.g., neutral “rake” with negatively sloping roof). The images inspired the managers to consider other design features for investigation such as the “front overhang” and “hood slope.” These managers stated further that deep generative models are cost-effective, inform and augment designer intuition, and may offset any (human) biases in the design generation and selection process. Although this evidence is anecdotal, the images seemed to be extremely valuable to guide exploration of the design space by experienced senior managers.

7.2.5. Anticipating Successful Designs.

One of the first questions practicing designers ask is whether the model can generate “creative” designs. Our model was only trained on data from MY2010–2014. Many new aesthetic designs have since been introduced to the market that are not in our training data. As a minimal test of the ability to produce known creative designs, we examine whether the generator could have produced images that are similar to since-introduced MY2020 vehicles.

Figure 8 compares four SUV/CUV designs from the MY2010–14–trained generator to four new SUV/CUV designs from MY2020. Although not identical, the generated images evoke holistic aesthetics of the recently introduced vehicles. At a more detailed level, the comparative designs are similar on common measures of vehicle aesthetics such as proportion, surface, and detail (PSD). For example, the second column contains a generated design with very high and positively angled “beltline” coupled with a dramatically downward-swooping “greenhouse,” a design later introduced in the Mercedes GLE. Because new aesthetically appealing PSDs are of particular interest in designers’ creative visions, it is encouraging that the generator discovers designs that have PSDs comparable to new production vehicles introduced successfully to the market six years after the time frame from which the training, validation, and test data were drawn.

Figure 8. (Color online) Examples of Generated Designs (MY2010-14 Data) and Actual Production Designs (MY2020)

8. Applications in Other Categories: Dining Room Chair Example

We engineered a machine learning augmentation that would apply generally across product categories. We chose the automobile industry for our initial proof-of-concept application because product aesthetics are particularly valuable to the automobile industry. Our partner provided us with a unique opportunity to understand organizational needs and access to the same proprietary data used routinely by human designers. In this section, we explore an additional application using publicly available images of dining room chairs. We collected aesthetic ratings from an online panel of responders using the same procedure described in Section 7.2 and trained and tuned the model as described in Sections 4 and 5. For replication, we provide our codebase as open source.

8.1. Dining Room Chair Images

The images for our second application come from an open-source data set of chair images provided by Aubry et al. (2014). These images were created by rotating 3D computer-aided design (CAD) drawings of chairs and taking 2D image snapshots across 62 angular viewpoints for each chair. The market for chairs, like the automotive market, is segmented; predictions make the most sense within a segment. We chose the dining room chair segment that had 700 unique dining room chairs—one of the largest segments in the sample—for a total of 43,400 images (62 × 700) across all viewpoints. We preprocessed the images to grayscale and downscaled to a variety of resolutions (8 × 8, 16 × 16,…, 128 × 128) for progressive training. We reparametrized the image viewpoints to consistent angular coordinates.9

To implement the augmentation model consistently with the automotive application, we obtained aesthetic ratings (labels) for 200 of the 700 dining room chairs. Two hundred labeled images are comparable in number to the 203 unique labeled automobiles (SUVs/CUVs; Section 6.1). Based on small-sample qualitative research and an initial pilot test with 101 Amazon Mechanical Turk respondents, we selected a five-point semantic aesthetic scale from “Very Traditional to Very Modern” as most descriptive, most consistent, and least ambiguous aesthetic dimension.10 For this aesthetic scale, the survey respondents provided consistent ratings for each chair design. The average rating varied across the chairs. We sourced 510 new respondents from the same professional panel used for the automotive data (ProdegeMR). Instructional manipulation checks filtered our sample to 348 attentive respondents.

8.2. Model Training and Predictive Test

We trained the proposed model for the new category up to the highest resolution consistent with the opensource images: 128 × 128. We used the same procedures as in the automotive application. For the predictive tests, we use an equivalent baseline and the same computer vision and pretrained deep learning models. Table 3 displays the predictive tests for the dining room chair images. The custom deep learning model outperforms both conventional machine learning and pretrained deep learning. Interestingly, conventional machine learning outperforms pretrained deep learning. Conventional machine learning, tuned to predictive ability, is almost as good as that of the custom deep learning model, which must also predict and generate. The lower quality of the unlabeled dining room chair images likely puts an upper bound on the predictive ability of any machine learning model, conventional or custom deep learning. At minimum, the predictive test for dining room chair images suggests that the custom deep learning model can predict well even when trained on lower-quality images.

Table

Table 3. Predictive Test for Dining-Room Chairs

Table 3. Predictive Test for Dining-Room Chairs

Prediction modelMean absolute error (standard deviation)Improvement
Baseline: Median rating in training images (constant rating)0.480 (0.011)0.0%
Benchmark: Computer vision features and random forest (conventional machine learning)0.430 (0.016)10.4%
Benchmark: VGG16 with fine-tuned final layers (pretrained deep learning)0.434 (0.036)9.6%
Proposed machine learning augmentation (custom deep learning)0.423 (0.013)11.9%

8.3. Illustration of the Generator

The dining room chair data set used in our application is widely accepted in computer science research focused on generative modeling. In the automotive proof-of-concept (Sections 6 and 7), we were fortunate to have access to senior managers and to images of successful SUVs launched after the time of data collection. We do not have an industry partner in the furniture category nor the high-quality proprietary images normally retained by retailers and manufacturers. Nonetheless, we evaluate whether we can generate images that are as dining room chair–like as opensource CAD images and literature baselines.

The first row of Figure 9 is a sample of the open-source dining room chair CAD images. The second row illustrates chairs generated by our proposed model. Our generator sought to generate new aesthetic designs that were not among the open-source images (i.e., our goal is not reconstruction). The third row illustrates the generated images for three established baselines—deep convolutional inverse graphics network (DC-IGN; Kulkarni et al. 2015), information maximizing GAN (InfoGAN; Chen et al. 2016a), and a VAE (Kingma and Welling 2013). The baseline models were trained on the same open-source data that we used (Higgins et al. 2017); we provide two typical examples for each of the three models.

Figure 9. Example Open-Source and Generated Dining Room Chairs

Our model and the established baselines generate images that are dining room chair–like. They suggest ideas that a designer might pursue. Our model appears to generate images that are crisper than those generated by DC-IGN and the VAE and are at least as good as those generated by InfoGAN. This observation is in line with our modeling approach that combines the VAE architecture for controllability and adversarial components in training for realism. At minimum, the replication in dining room chairs suggests our model does as well as established models when trained on sparse CAD images.

The generated dining room chair images are not as smooth and complete as the automotive images, likely because of the lower quality of the dining-room-chair image data. Access to sufficiently large samples of high-quality images might be an important requirement for the industry adoption of the generative models for product aesthetic design. Fortunately, many firms routinely generate many high-quality-image aesthetic designs in the normal course of business (see Sections 2.1 and 8.4).

8.4. Summary of the Replicability Test

The dining room chair application demonstrates that our proposed integrated deep learning model is tractable outside of the automotive industry. We find the dining room chair results qualitatively mirror those from the automotive case study. Quantitative differences are likely due to the lower realism of dining room chair images (2D images from 3D CAD renderings). Conditional on the available data, the predictive and the generative abilities in the replication study are comparable to the state-of-the-art baselines.

The number and quality of unlabeled images likely affects predictive and generative performance. We believe many product categories are amenable to our work. For example, there are 10,000 home-furnishing items in the IKEA catalog. Dyson has several hundred sketches per new product across a large number of products. There are 700,000 SKUs in the product line of a large fashion retailer. Even within apparel segments, there should be sufficiently many high-quality unlabeled images to train the model well.

9. Conclusion

9.1. Discussion and Summary

Deep learning methods are beginning to affect all aspects of marketing science, sometimes with methods customized to the challenge, sometimes with tuned pretrained models. Many of these methods rely on hard-to-quantify unstructured data such as natural language or images. We focus on consumers’ aesthetic judgments of images by using state-of-the-art machine learning to augment human design decisions. The augmentation comes in two forms. First, we predict consumer evaluations of new potential aesthetic designs from images. Second, we controllably generate new images to enhance creative design.

Our machine learning approach combines many concepts with an overall goal of aligning with actual aesthetic design processes used at firms. We developed a version of the semi-supervised VAE model that uses a low-dimensional embedding to “bottleneck” information between a predictive and generative model. Within the VAE framework, we include attributes to carry information about images and include masks to constrain target images to be realistic. We add adversarial training concepts from the GAN literature to improve training of realistic generated images. Finally, we use a variety of engineering ideas (e.g., spectral normalization, progressive training, residual connections, adaptively balanced training losses) to tune the deep learning model so it stably converges during training.

Our proposed augmentation is based on our understanding of how real organizations design product aesthetics. Our model recognizes the delicate and iterative interplay of machine learning and existing human workflows, a point stressed heavily in our working interactions with managers and designers. Our model is separable in use, enabling adoption by asynchronous and distributed design and testing teams. We focus on augmenting rather than automating human expertise and human creativity, and ensuring all models are meaningfully controllable by the respective teams within the firm.

Our model addresses the practical challenge of using a relatively limited amount of costly-to-obtain labeled images. If we were to train a deep learning model on labeled images alone, the embeddings and generative capability would be weak at best. We overcome the challenge with semi-supervised learning that combines expensive rated “thin data,” with less expensive and significantly larger “big data.” High-quality unlabeled images are not inexpensive and often protected by copyright, but they are often available to the firm.

We demonstrate that the predictor predicts image ratings better, sometimes substantially better, than strong, tuned machine learning benchmarks such as conventional computer vision methods and pretrained deep learning models. We demonstrate (1) that the generator generates face-valid images, (2) that consumers evaluate as aesthetically appealing images created to be aesthetically appealing, (3) that generated images anticipate designs that were introduced to the marketplace five to six years later, (4) that the model can be tuned to process other aesthetic scales, and (5) that the model can be applied to nonautomotive product categories (Section 8). Anecdotally, the automotive generator is viewed by senior design and marketing managers as valuable and worth further investment.

9.2. Limitations and Further Research

The SUV/CUV application is a proof-of-concept developed to augment aesthetic design teams. New vehicles take many years and $1–3 billion in investment (Blonigen et al. 2013). Over time, we will learn whether machine learning augmentation has documented monetary benefits beyond the qualitative benefits illustrated in this paper. Directly assessing the financial value of aesthetics is challenging given its interrelatedness of new aesthetic designs with confounding factors such as functional attributes, brand identity, marketing, pricing, and aesthetic trends (Person et al. 2016). Recent promising work into disentangling these factors includes those that explicitly control covariation in functional and form attributes (Higgins et al. 2018, Kang et al. 2019, Zhang et al. 2019) and those that temporally model aesthetic trends (Yoganarasimhan 2017). For now, our work relies on predictive statistics, generative illustrations, and managerial judgment for validation.

9.2.1. Image Quality.

Professional-level images enhance quality by controlling for variables that inadvertently affect aesthetic perceptions. Variables include f-stop (e.g., fishbowl, telescopic), zoom level, azimuthal capture angle, chromatic aberrations, lighting (saturation and hue), day versus night, masking controls, background images, occlusions in foreground images, and visual noise.

Higher resolutions help the model identify aesthetic dimensions not available at lower resolutions and account for how those dimensions combine holistically. This was evidenced in the automotive application when training the model at lower resolutions during progressive training, as well as in the chair application, in which the data were of limited resolution and realism (see Section 8.4). Given these benefits, it is not surprising that human designers prefer to work with higher-quality (higher resolution) images. However, tuning the machine learning models for higher resolutions requires more training data, as the models need to encode and generate more complex aesthetics. Future research could enhance quality by developing approaches for modeling and evaluating holistic 3D designs (Wu et al. 2016).

9.2.2. Data Needs.

Product categories vary in how challenging they are to model. For example, the aesthetics for dining-room chairs are likely simpler to model than human faces, whereas human faces are likely simpler to model than automobiles (Karras et al. 2017). Although our experience with two proof-of-concept categories suggests it might be sufficient to have unlabeled data in the tens of thousands and labeled data in the hundreds, more applications will pin down data needs. Of course, the more data the better, and advances in machine learning research can further reduce the data needs.

9.2.3. Scale.

Although our model was initially engineered to be effective for automotive vehicles, the principles and modeling decisions generalize to other marketing/aesthetic design applications. We used 180,000 unlabeled images in training—a typical scale for automotive aesthetic design applications. If the model was to be scaled to millions or billions of images, we would likely need to rely on distributed computing and different neural network architectures.

9.2.4. Technical Issues.

Many of the technical challenges came from combining VAE and GAN concepts. When “realistic” generation alone is the primary goal, pure GANs often outperform VAEs and flow-based approaches (Kang et al. 2017, Pan et al. 2017, Sbai et al. 2019). However, GANs often lack the latent-space embedding structure needed for predictive modeling and controllable “creative” generation. Further development of deep generative modeling frameworks, improved methods for systematic model tuning, and further experience in other applications and product categories will simplify the calibration of augmentation models (see Section 5.2).

9.2.5. Further Work with Designers.

A natural next step is to assess the degree to which the proposed approach augments designer creativity. Although perhaps less straightforward to measure, similar questions have seen recent marketing interest in applications such as idea generation (Toubia and Netzer 2017) and branding and logo generation (Dew et al. 2022). Machine learning methods that promote “diversity” of generated designs, for example, augment designer creativity via larger search of the space of designs (Nobari et al. 2021). Likewise, methods such as “disentangled” representation learning may offer designers and managers opportunities to identify new aesthetic attributes (Higgins et al. 2018). As we continue to be guided by real design needs and managerial problems, ongoing advances in machine learning bode well for a future of augmenting human intelligence with machine intelligence.

Acknowledgments

The authors thank Jeff Hartley, John Manoogian II, Andrew Norton, Joyce Salisbury, Zheng Shen, and Sharon Sheremet for valuable insights into how product aesthetics are designed and evaluated; Mark Beltramo, Remi Daviet, Fred Feinberg, Ari Helljaka, Honglak Lee, Ye Liu, and Yanxin Pan for mathematical modeling discussion; and Emrah Bayrak, Songting Dong, Dean Eckles, Nasreddine El Dehaibi, Ryan Dew, Siham El Kihal, Gui Liberali, Erin MacDonald, Ye Liu, Max Yi Ren, and Glen Urban for helpful comments and suggestions. A. Burnap received support from General Motors to partially fund a postdoctoral research position for the research conducted in this work. He certifies that none of the research or its results were censored or obfuscated in its publication. J. Hauser and A. Timoshenko certify that they have no affiliations with or involvement in any organization or entity with any financial interest or nonfinancial interest in the subject matter or materials discussed in this manuscript.

Appendix A

A.1. Summary of Notation

Table

Table A.1. Summary of Notation

Table A.1. Summary of Notation

NotationDescription
XiProduct image i; contains D pixels xid
MiProduct mask; contains D binary values mid
yiAesthetic rating
aiProduct attributes vector (if known)
X^iProduct image (generated); contains D pixels x^id
M^iProduct mask (generated); contains D binary values m^id
y^iAesthetic rating (predicted)
a^iProduct attributes vector (predicted)
μi, σiProduct embedding distribution parameters, qenc(h|Xi,ai)
hiProduct embedding vector (K-dimensional)
βE, βP,βGEstimated parameters of the encoder, predictor, and generator

A.2. Probabilistic Formulation and Loss Function Separability

To ease notation, we temporarily write all parameters and likelihoods for a single datum i. We seek a joint distribution, p(yi,Xi|ai,β), for the ratings and images conditioned on the design attributes ai and the parameters β. The joint distribution can be decomposed into a predictive model and a generative model by the laws of conditional probability:

p(yi,Xi|ai,β)=ppred(yi|Xi,ai,β)pgen(Xi|ai,β).(A.1)

Representing and estimating the two conditional distributions is not feasible when product images are high dimensional. To address high dimensionality, we approximate the true joint distribution using embeddings. In our case, the embedding compresses information from high-dimensional images, aesthetic ratings, and product attributes to enable tractable predictive and generative models.

We estimate an embedding posterior distribution, qenc(hi|Xi,ai,β), for each product design i such that hi has substantially fewer dimensions (e.g., 512) than Xi (e.g., 786,432), yet retains most of the information contained in the images, ratings, and product attributes. To infer embeddings, we use variational Bayes methods to approximate the true joint log-likelihood, logp(yi,Xi|ai,β), with an approximate log-likelihood, approxi(β), dependent on the embeddings, hi (Jordan et al. 1999, Blei et al. 2017).

To obtain this approximation, we first condition the true log likelihood, logp(yi,Xi|ai,β), on the embeddings via marginalization, which leads to the logarithm of an expectation. We then approximate the logarithm of the expectation by the expectation of the logarithm. By Jensen’s Inequality, the approximation is a lower bound to the true log likelihood. Rearranging terms we arrive at the approximate likelihood in Equation (1). Given an image, Xi, its rating, yi, and its attributes, ai, we seek to maximize approxi(β) and thus approximately maximize logp(yi,Xi|ai,β). If DKL() signifies the Kullback-Leibler (KL) divergence, then

approxi(β)=Ehi[logppred(yi|hi,β)+logpgen(Xi|hi,β)logqenc(hi|Xi,ai,β)+logpprior(hi|ai)] =Ehi[logppred(yi|hi,β)+logpgen(Xi|hi,β)]DKL(qenc(hi|Xi,ai,β)pprior(hi|ai)).(A.2)

Equation (A.2) is intuitive. The first term seeks to maximize our ability to predict aesthetic ratings based on the embeddings. The second term seeks to reproduce images based on the embeddings. The last term is the negative of KL divergence from the embedding (distribution) to the prior on the embeddings. For this derivation, we have assumed conditional independence between images, ratings, and attributes given the embedding.

Because the approximate log-likelihood, approxi(β), is a lower bound to the true log likelihood, logp(yi,Xi|ai,β), the approximation is commonly called the “evidence lower bound” (ELBO; Jordan et al. 1999). We derived Equation (A.1) for every datum i, which leads to an overall full data approximate log-likelihood, L(β), which we maximize over parameters, β, using the observed images and ratings:

L(β)=iapproxi(β).(A.3)

We use conditional probability to separate Equation (A.2) into three component models:

L(β)=Lpred(βP)+Lgen(βG)+Lenc(βE)Lpred(βP)=iratedEhi[logppred(yi|hi,βP)]Lgen(βG)=irated,unratedEhi[logpgen(Xi|hi,βG)]Lenc(βE)=irated,unrated{DKL(qenc(hi|Xi,ai,βE)pprior(hi|ai))+DKL(qattr(πi|Xi,βE)pattr(πi|ai))}.(A.4)

In our empirical application, for (relatively) less expensive unlabeled data, we have access to product attributes (e.g., color, brand). Accordingly, the encoder log-likelihood, Lenc(βE), differs from that in Equation (A.2) because we also learn the relationship between images and their attributes. Thus, in addition to the embeddings, hi, we use variational inference to encode attribute information with πi. Following the same reasoning used to derive Equation (A.2), we obtain the last KL divergence term in the encoder log-likelihood. This term acts as a classifier for attributes, ai, when the attributes are known. We use the classifier to predict attributes when they are unknown or ambiguous (e.g., when generating new designs). See also Keng (2017) and Jang et al. (2016).

We choose probability distributions for the predictive, generative, and encoding models in Equation (A.4) using the framework of VAE (Kingma and Welling 2013). Under the proposed distributional assumptions, the log-likelihood formulation is equivalent to the loss-function formulation labeled as Equation (1).

A.2.1. Predictive Model.

For the predictive model, we choose Laplace distributions with means yi and unit diversity, ppred(yi|h,βP)=12e|yifP(hi,βP)|. The Laplace distribution is converted to the L1 norm via the log-likelihood, thus enabling a probabilistic interpretation for the absolute loss in Equation (A.5). This predictive distribution implies that we minimize the mean absolute error of predicted versus true ratings, where y^=fP(hi,βP):

Lpred(βP)=iratedEhi[logppred(yi|hi,βP)]=irated|yiyi^|.(A.5)

A.2.2. Generative Model.

We choose the generative model to be a high-dimensional Laplace distribution with means, Xi, and unit diversity, pgen(Xi|hi,βG)e|XifG(hi,βG)| . Similarly, we assume a high-dimensional Laplace distribution for masks Mi. As with the predictive model, the Laplace distribution is naturally conjugate to the L1 norm enabling a probabilistic interpretation for the absolute loss. This implies the following log-likelihood function for the generative model:

Lgen(βG)=irated,unratedEhi[logpgen(Xi,Mi|hi,βG)]=irated,unrated{13Dd|xidx^id|+1Dd|mdm^id|}.(A.6)

A.2.3. Encoding Model.

For the embedding variational family, we choose multivariate Gaussian mixture distributions with mixture components depending on product attributes, ai, (e.g., SUV). The embedding, hi, has a Gaussian mixture marginal distribution, but hi|ai has a single Gaussian conditional distribution given attributes (Dilokthanakul et al. 2016). This expands a representation capacity of our model (Ranganath et al. 2015), without resorting to more complicated autoregressive and flow-based methods (Chen et al. 2016b).

We further assume each K-dimensional Gaussian has diagonal covariance, thereby factorizing into K conditionally independent Gaussians in which K is the dimensionality of the embeddings. If k indexes the elements of the embedding, then the variational assumption implies that qenc(hi|Xi,ai,βE)k=1Kσik1e12(hikμik)2σik2, where the μi and σi are functions of Xi, ai, and βE. Following Kingma and Welling (2013), we obtain a simpler representation DKL(N(μik,σik)N(0,1)).

For the second divergence term in the encoder log-likelihood, we show below that we may approximate DKL(qattr(πi|Xi,βE)pattr(πi|ai))constantlogqattr(ai|Xi,βE). This results in the encoder acting as a multinomial classifier, qattr(ai|Xi,βE), to predict attributes from product images. Specifically, we have C multinomial distributions, where C is the number of attributes (e.g., brand, body type) and c is the number of levels of attribute c.

The encoder neural net for μi and σi thus also produces πi, from which we draw Dirichlet probabilities, a^i=qenc(πi|Xi,βE), using a soft-max function. We recognize Eπi[logqenc(πi|Xi,βE)] as the cross-entropy for a draw of the attributes, ai, from the multinomial probabilities, a^i. This provides the second term in the loss function. During training, this term encourages the encoder to learn attributes, whereas during prediction (when we do not know attributes), this term allows us to estimate unknown product attributes, a^i, by sampling from the multinomial distribution indexed by πi. Putting both terms together we obtain:

Lenc(βE)=irated,unrated{k=1K12[(μki2+σki2)+logσki]+c=1C=1caicloga^ic}.(A.7)

A.2.4. Derivation of Approximate Log-Likelihood.

We seek low-dimensional latent embeddings hi. We marginalize hi over the joint density in the second line of Equation (A.7). We expand this density to the predictive model and generative model and a prior over the product embedding.

logp(yi,Xi|ai,β)=logp(yi,Xi,hi|ai,β)dh=logppred(yi|hi,β)pgen(Xi|hi,β)pprior(hi|ai,β)dh(A.8)

We seek to learn an embedding distribution rather than just a point estimate of hi. We do not explicitly assume this form of the new joint density with the introduced product embedding, hi, and instead introduce a tractable distribution that we will use to approximate it, qenc(hi|Xi,ai,β), resulting in the “encoder model.”

logppred(yi|hi,β)pgen(Xi|hi,β)pprior(hi|ai,β)dh=logppred(yi|hi,β)pgen(Xi|hi,β)pprior(hi|ai,β)qenc(hi|Xi,ai,β)qenc(hi|Xi,ai,β)dh=logEhiqenc(hi|Xi,ai,β)[ppred(yi|hi,β)pgen(Xi|hi,β)pprior(hi|ai,β)qenc(hi|Xi,ai,β)](A.9)

We find the best encoding model, qenc(hi|Xi,ai,β), for each datum i from an assumed family of tractable densities. We estimate hyperparameters of the latent product embedding, hi, which index a unique element within an assumed variational distribution family.

Estimating these parameters using sampling techniques (e.g., MCMC) is typically intractable (Daviet 2018); hence, we cast sampling as an optimization problem using a lower bound of the expectation via Jensen’s inequality. This approximation is known as the “evidence lower bound,” which is less than or equal to the intractable high-dimensional joint density, logp(yi,Xi|ai,β).

logEhiqenc(hi|Xi,ai,β)[ppred(yi|hi,β)pgen(Xi|hi,β)pprior(hi|ai,β)qenc(hi|Xi,ai,β)]Ehiqenc(hi|Xi,ai,β)[logppred(yi|hi,β)pgen(Xi|hi,β)pprior(hi|ai,β)qenc(hi|Xi,ai,β)](A.10)

With the logarithm moved inside the expectation, we decompose the joint density into three separate terms: the predictive model, the generative models, and the ratio of the encoder and prior model. Under the expectation of the encoding model, this last term is the Kullback-Leibler divergence between the encoder and the prior over the embedding, DKL(logqencpprior).

Ehiqenc(hi|Xi,ai,β)[logppred(yi|hi,β)pgen(Xi|hi,β)pprior(hi|ai,β)qenc(hi|Xi,ai,β)]=Ehiqenc(hi|Xi,ai,β)[logppred(yi|hi,β)]+Ehiqenc(hi|Xi,ai,β)[logpgen(Xi|hi,β)]Ehiqenc(hi|Xi,ai,β)[logqenc(hi|Xi,ai,β)pprior(hi|ai,β)]=Ehiqenc(hi|Xi,ai,β)[logppred(yi|hi,β)]+Ehiqenc(hi|Xi,ai,β)[logpgen(Xi|hi,β)]DKL(logqenc(hi|Xi,ai,β)pprior(hi|ai,β))=approxi(β)(A.11)

These three terms comprise the approximation, approxi(β), that we maximize. Because the Kullback-Leibler divergence term is negative, maximizing the overall approximation includes minimizing distributional dissimilarity between the posterior of the embedding given by the encoding model, qenc(hi|Xi,ai,β), and the distributional prior that we choose, pprior(hi|ai,β). If we minimize this divergence to zero, the approximate likelihood is equal to the true likelihood, that is, logp(yi,Xi|ai,β)=approxi(β). Thus, maximizing approxi(β) lower bounds the previously intractable likelihood maximization of logp(yi,Xi|ai,β).

A.2.5. Attributes.

Adding the latent variables for parameters of the multinomial attribute classifier, πi, results in a double integral and a corresponding expectation over the joint density of both hi and πi. Our assumptions on factorization of the latent terms splits into two KL-divergence terms in the last line of the derivation. See Keng (2017) for additional discussion on the relation between KL-divergence and the cross-entropy loss term.

logp(yi,Xi|ai,β)=logp(yi,Xi,hi,πi|ai,β)dhdπ=logppred(yi|hi,β)pgen(Xi|hi,β)pprior(hi|ai,β)×pattr(πi|ai,hi,β)dhdπ=logppred(yi|hi,β)pgen(Xi|hi,β)×pprior(hi|ai,β)pattr(πi|ai,hi,β)q(hi,πi|ai,Xi,β)q(hi,πi|ai,Xi,β)dhdπ=logEhi,πiq(hi,πi|ai,Xi,β)[ppred(yi|hi,β)pgen(Xi|hi,β)×pprior(hi|ai,β)pattr(πi|ai,hi,β)q(hi,πi|ai,Xi,β)]Ehi,πiq(hi,πi|ai,Xi,β)[logppred(yi|hi,β)pgen(Xi|hi,β)×pprior(hi|ai,β)pattr(πi|ai,hi,β)q(hi,πi|ai,Xi,β)]=Ehi,πiq(hi,πi|ai,Xi,β)[logppred(yi|hi,β)+logpgen(Xi|hi,β)+logpprior(hi|ai,β)qenc(hi|Xi,ai,β)+logpattr(πi|ai,β)qattr(πi|Xi,β)]=Eπiq(πi|Xi,β)[Ehiq(hi|ai,Xi,β)[logppred(yi|hi,β)+logpgen(Xi|hi,β)logqenc(hi|Xi,ai,β)pprior(hi|ai,β)logqattr(πi|Xi,β)pattr(πi|ai,β)]]=Ehiq(hi|ai,Xi,β)[logppred(yi|hi,β)]+Ehiq(hi|ai,Xi,β)[logpgen(Xi|hi,β)]DKL[q(hi|ai,Xi,β)pprior(hi|ai,β)]DKL(qattr(πi|Xi,β)|pattr(πi|ai,β))(A.12)

A.3. Gradient Backpropagation Using Local Reparameterization

We train the model by minimizing the loss functions in Table 1 with first-order stochastic gradient methods using mini-batches of training data. Specifically, we use the Adam stochastic gradient optimizer (Kingma and Ba 2015). Stochastic gradient methods are justified given their empirical performance and scalability via the backpropagation algorithm. Backpropagation simplifies an otherwise large calculation of a multi-parameter gradient to an equivalent series of smaller iterative gradient calculations. Gradients for a given loss in Table 1 propagate from the layer calculating the loss backward to “earlier” layers, thereby taking advantage of the compositional structure of network layers and the chain rule of differentiation.

To use gradient methods, we use the “reparameterization trick” used in Kingma and Welling (2013) and further popularized by the success of the variational autoencoder (VAE). We rewrite the otherwise intractable gradient of an expectation over the embedding to an equivalent tractable formulation by splitting the stochastic Gaussian embedding distribution into a deterministic neural net and an independent additive stochastic term. With this simplification it is feasible to compute an unbiased estimate of the gradient using Monte Carlo samples of the independent additive term. We similarly use this reparameterization trick when we do not have access to product attributes during training and inference. In this case, we use a relaxation of the otherwise nondifferentiable categorical attribute variables called the Gumbel-Softmax (Jang et al. 2016).

A.4. Example Masks for Compact Utility Vehicles (CUV)

Figure A.1. (Color online) Example Masks for Compact Utility Vehicles (CUV)

A.5. Examples of Images at Successive Stages of Progressive Training

Figure A.2. (Color online) Examples of Images at Successive Stages of Progressive Training

A.6. Example Rating Page from Aesthetic Rating Survey used in Theme Clinic

Figure A.3. (Color online) Example Rating Page from Aesthetic Rating Survey used in Theme Clinic

A.7. Examples of Training Instabilities: (a) Gradient Explosion and (b) Posterior Collapse

Figure A.4. (Color online) Examples of Training Instabilities: (a) Gradient Explosion and (b) Posterior Collapse

A.8. Embedding Dimensionality

We found that a 512-dimensional embedding was best able to encode information from images while still keeping the size of the entire model manageable to train. Underparameterization leads to insufficient model capacity, whereas overparameterization leads to excessively long training times. The following images provide examples of under- and overparameterization.

Figure A.5. (Color online) Embedding Dimensionality

A.9. Product Attributes Available for the Unlabeled Images

Table

Table A.2. Product Attributes Available for the Unlabeled Images

Table A.2. Product Attributes Available for the Unlabeled Images

NameDescription
YearModel year of the vehicle (e.g., MY2014). Also known as “vintage.”
BrandOne of 48 possible brands in data. Each brand is often a subset of overall firm (e.g., Cadillac is a brand of General Motors).
ModelIndicator of vehicle in brand’s overall lineup (e.g., Audi A3 is smallest sedan offered by Audi in the U.S. market)
ViewpointAzimuth angle of the vehicle that image was taken from.
Body TypeVehicle exterior categorization into one of: Convertible, Coupe, Sedan, Hatchback, Wagon, CUV, SUV, Truck, Minivan, Passenger Van, Cargo Van
ColorRGB coding of the primary exterior color of the vehicle.

A.10. Deep Pretrained Model with and Without Product Attributes

Table

Table A.3. Deep Pretrained Model with and Without Product Attributes

Table A.3. Deep Pretrained Model with and Without Product Attributes

Prediction modelMean absolute error (standard deviation)
Benchmark: Pretrained VGG16 and fine-tuned final layers without product attributes0.405 (0.039)
Benchmark: Pretrained VGG16 and fine-tuned final layers with product attributes0.411 (0.050)

A.11. Effect of Less Labeled Data (Rated Images) for Semi-supervised Prediction: Comparison with YOLO

Table

Table A.4. Effect of Less Labeled Data (Rated Images) for Semi-supervised Prediction: Comparison with YOLO

Table A.4. Effect of Less Labeled Data (Rated Images) for Semi-supervised Prediction: Comparison with YOLO

Prediction modelMean absolute error (standard deviation)
Benchmark: Pretrained VGG16 and fine-tuned final layers: 100% labeled data0.405 (0.039)
Benchmark: Pretrained VGG16 and fine-tuned final layers: 50% labeled data0.452 (0.078)
Proposed machine learning augmentation (custom deep learning): 100% labeled data0.350 (0.043)
Proposed machine learning augmentation (custom deep learning): 50% labeled data0.404 (0.025)
Proposed machine learning augmentation (custom deep learning): 25% labeled data0.463 (0.045)
Proposed machine learning augmentation (custom deep learning): 10% labeled data0.551 (0.104)
You Only Look Once (YOLOv5)0.422 (0.016)

A.12. Effect of Less Unlabeled Data (Unrated Images) for Semi-supervised Prediction

Table

Table A.5. Effect of Less Unlabeled Data (Unrated Images) for Semi-supervised Prediction

Table A.5. Effect of Less Unlabeled Data (Unrated Images) for Semi-supervised Prediction

Prediction modelMean absolute error (standard deviation)
Proposed machine learning augmentation (custom deep learning): 100% unlabeled data0.350 (0.043)
Proposed machine learning augmentation (custom deep learning): 50% unlabeled data0.382 (0.041)
Proposed machine learning augmentation (custom deep learning): 10% unlabeled data0.381 (0.041)

A.13. Effect of Restricting Model Training to Single Viewpoint (Sideview)

Table

Table A.6. Effect of Restricting Model Training to Single Viewpoint (Sideview)

Table A.6. Effect of Restricting Model Training to Single Viewpoint (Sideview)

Prediction modelMean absolute error (standard deviation)
Proposed machine learning augmentation (custom deep learning): All viewpoints0.350 (0.043)
Proposed machine learning augmentation (custom deep learning): Single viewpoint0.386 (0.036)

A.14. Predictive Test for Aesthetic Innovativeness

We recalibrated the proposed model to predict aesthetic “innovativeness” using a previously trained (proposed) model originally trained for aesthetic “appeal.” We followed the same procedure as the pretrained VGG model (see Section 7.1) by replacing the predictor for “appeal” with new (untrained) neural network layers to now predict “innovativeness.” Likewise, we initially trained only the new layers while freezing the rest of the previously trained model to allow the new layers’ parameters to stabilize, followed by “fine-tuning” training of the entire model. The resulting predictive results for “innovativeness” are given in Table A.7.

Table

Table A.7. Predictive Test for Aesthetic Innovativeness

Table A.7. Predictive Test for Aesthetic Innovativeness

Prediction modelMean absolute error (standard deviation)Improvement
Baseline: Median rating in training images (constant rating)0.627 (0.069)0.0%
Benchmark: Computer vision features and random forest (conventional machine learning)0.496 (0.064)20.9%
Benchmark: VGG16 with fine-tuned final layers (pretrained deep learning)0.311 (0.032)50.4%
Proposed machine learning augmentation (custom deep learning) fine-tuned final layers0.253 (0.063)59.6%

A.15. Brief Definitions of Machine Learning Terms

2D Average Pooling.

Partitions the input data (e.g., activations from previous neural net layer) over two spatial dimensions and computes the average value within each of the subsets.

2D Convolution.

Convolution is performed along two spatial dimensions of the input data. Convolution multiplies and accumulates from overlapping samples of the input data using learned kernels.

Adaptively Balanced Training Losses.

The loss function used for model training consists of several weighted terms which dynamically change in reaction to training stability metrics.

Annealed from Zero.

Increasing the value of a loss term’s weighting coefficient as training progresses.

Batch Normalization.

Fix the means and variances of each layer’s inputs to make the neural network faster and more stable.

Leaky Rectified Linear.

The output function will be the input if the input is positive, otherwise the output will be a constant times the input. The constant is usually less than 1.

Lipschitz Continuity.

The absolute value of a gradient is constrained to be no larger than a constant along any given direction.

Minibatches.

Splits of the training data into small subsets to calculate losses and update coefficients.

Neuronal Receptive Field.

The subset of input from a previous layer that feed into a single “neuron” in a neural network, analogous to “neurons” in the human visual cortex V1 and V2.

Rectified Linear.

The output function will be the input if the input is positive, otherwise the output will be zero.

Residual Connections.

They allow gradients to flow through a network directly without passing through non-linear activation functions.

Spectral Normalization.

A weight normalization procedure that stabilizes the training of deep neural networks. Replace every weight in a layer’s weight matrix with the weight divided by the largest eigenvalue of the matrix.

Squeeze-and-Excite.

A neural network layer that explicitly models the interdependencies across channels (e.g., RGB for the input layer, number of kernels for convolutional layer) from the previous layer by 2D pooling the layer and performing “self-attention” to learn dependencies.

Stochastic Gradient.

A stochastic approximation to the gradient used in gradient descent optimization. Replaces the gradient with an estimate based on a subset of the data.

Endnotes

1 Even without working at the pixel level, aesthetic design is high dimensional. Pfitzer and Rudolph (2007) describe 17 distinct elements (e.g., doors, roof, pillars, headlights lights) and 10 different design lines (e.g., roof line, cowl line, rail line). Each of these elements has multiple levels. All elements interact for visual and emotional appeal.

2 In our automotive application, the images contain 786,432 dimensions based on 512 × 512 × 3 (height × width × color) pixels. The dimensionality of the embedding space (512-dimensional) is a fine-tuning decision (Section 5.2).

3 Appendix A.13 provides an analysis where a model is trained using side views only. Augmenting training data with multiple viewpoints (±20 azimuth degrees) improves the predictive performance.

4 Because of the stratified sampling, the data split ratios (50%:25%: 25%) are approximate.

5 Figure 5 uses eight open-access image equivalents of the proprietary image data provided by our partner firm.

6 For completeness, we used two pretrained deep learning baselines: one without attributes (Table 2) and another with attributes (Appendix A.10). The results are consistent; attributes do not improve predictive performance.

7 We get the same relative insights if the data are ranks and not ratings.

8 Andrew Norton is an executive director of Global Market Research, Volume Forecasting, and Competitive Intelligence at General Motors. Jeff Hartley is an adjunct associate professor of Integrative Systems and Design at University of Michigan. He worked as a technical director at General Motors for more than 30 years.

9 Because the dining room chair images are substantially fewer and with lower resolution than the automotive images, our second application serves to test how our models perform with smaller and lower quality data.

10 We selected “traditional versus modern” over “obtrusive versus prominent” and “typical versus unique” based on respondent consistency in the MTurk pretest. In open-ended interviews, consumers found these three dimensions to be relevant, easy to evaluate, well defined, and unambiguous. Details available from the authors.

References

  • Aaker DA, Keller KL (1990) Consumer evaluations of brand extensions. J. Marketing 54(1):27–41.CrossrefGoogle Scholar
  • Aubry M, Maturana D, Efros AA, Russell BC, Sivic J (2014) Seeing 3D chairs: Exemplar part-based 2D-3D alignment using a large data set of CAD models. Proc. IEEE Conf. on Comput. Vision and Pattern Recognition, (IEEE, Piscataway, NJ)3762–3769.Google Scholar
  • Berlyne DE (1971) Aesthetics and Psychobiology (Appleton-Century-Crofts, East Norwalk, CT).Google Scholar
  • Berthelot D, Schumm T, Metz L (2017) BEGAN: Boundary equilibrium generative adversarial networks. Preprint, submitted March 31, https://arxiv.org/abs/1703.10717.Google Scholar
  • Blei DM, Kucukelbir A, McAuliffe JD (2017) Variational inference: A review for statisticians. J. Amer. Statist. Assoc. 112(518):859–877.CrossrefGoogle Scholar
  • Bloch PH (1995) Seeking the ideal form: Product design and consumer response. J. Marketing 59(3):16.CrossrefGoogle Scholar
  • Blonigen BA, Knittel CR, Soderbery A (2013) Keeping It Fresh: Strategic Product Redesigns and Welfare (National Bureau of Economic Research, Cambridge, MA).CrossrefGoogle Scholar
  • Bouchard C, Aoussat A, Duchamp R (2006) Role of sketching in conceptual design of car styling. J. Desert Res. 5(1):116.CrossrefGoogle Scholar
  • Chakraborty I, Kim M, Sudhir K (2022) Attribute sentiment scoring with online text reviews: Accounting for language structure and missing attributes. J. Marketing Res. 59(3):600–622.CrossrefGoogle Scholar
  • Chan TH, Mihm J, Sosa ME (2018) On styles in product design: An analysis of U.S. design patents. Management Sci. 64(3):1230–1249.LinkGoogle Scholar
  • Chen X, Duan Y, Houthooft R, Schulman J, Sutskever I, Abbeel P (2016a) InfoGAN: Interpretable representation learning by information maximizing generative adversarial nets. Adv. Neural Inform. Processing Systems, 29.Google Scholar
  • Chen X, Kingma DP, Salimans T, Duan Y, Dhariwal P, Schulman J, Sutskever I, et al. (2016b) Variational lossy autoencoder. Preprint, submitted November 8, https://arxiv.org/abs/1611.02731.Google Scholar
  • Cho H, Hasija S, Sosa M (2015) How Important Is Design for the Automobile Value Chain? (Social Science Research Network, Rochester, NY).CrossrefGoogle Scholar
  • Clement J (2007) Visual influence on in-store buying decisions: An eye-track experiment on the visual influence of packaging design. J. Marketing Management 23(9–10):917–928.CrossrefGoogle Scholar
  • Coates D (2003) Watches Tell More Than Time: Product Design, Information, and the Quest for Elegance (McGraw-Hill, London).Google Scholar
  • Cooper RG (1990) Stage-gate systems: A new tool for managing new products. Bus. Horizons 33(3):44–54.CrossrefGoogle Scholar
  • Creusen MEH, Schoormans JPL (2005) The different roles of product appearance in consumer choice. J. Production Innovative Management 22(1):63–81.CrossrefGoogle Scholar
  • Crilly N, Moultrie J, Clarkson PJ (2004) Seeing things: Consumer response to the visual domain in product design. Design Stud. 25(6):547–577.CrossrefGoogle Scholar
  • Danneels E, Kleinschmidtb EJ (2001) Product innovativeness from the firm’s perspective: Its dimensions and their relation with project selection and performance. J. Production Innovative Management 18(6):357–373.CrossrefGoogle Scholar
  • Daviet R (2018) Inference with Hamiltonian sequential Monte Carlo simulators. Preprint, submitted December 19, https://doi.org/10.48550/arXiv.1812.07978.Google Scholar
  • Dew R, Ansari A, Toubia O (2022) Letting logos speak: Leveraging multiview representation learning for data-driven logo design. Marketing Sci. 41(2):401–425.LinkGoogle Scholar
  • Dilokthanakul N, Mediano PAM, Garnelo M, Lee MCH, Salimbeni H, Arulkumaran K, Shanahan M (2016) Deep unsupervised clustering with Gaussian mixture variational autoencoders. Preprint, submitted November 8, https://arxiv.org/abs/1611.02648.Google Scholar
  • Ding M, Hauser J, Dong S, Dzyabura D, Yang Z, Chenting S, Gaskin S (2011) Unstructured direct elicitation of decision rules. J. Marketing Res. 48(1):116–127.CrossrefGoogle Scholar
  • Dzyabura D, Hauser JR, El Kihal S, Ibragimov M (2018) Leveraging the power of images in predicting product return rates. Preprint, submitted July 27, https://dx.doi.org/10.2139/ssrn.3209307.Google Scholar
  • Feldman J, Zhang DJ, Liu X, Zhang N (2022) Customer choice models vs. machine learning: Finding optimal product displays on Alibaba. Oper. Res. 70(1):309–328.Google Scholar
  • Gabel S, Timoshenko A (2022) Product choice with large assortments: A scalable deep learning model. Management Sci. 68(3):1808–1827.LinkGoogle Scholar
  • Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, et al. (2014) Generative adversarial nets. Adv. Neural Inform. Processing Systems, 2672–2680.Google Scholar
  • Griffin A, Hauser JR (1992) Patterns of communication among marketing, engineering, and manufacturing: A comparison between two new product teams. Management Sci. 38(3):360–373.LinkGoogle Scholar
  • Gross I (1972) The creative aspects of advertising. Sloan Management Rev. 14(1):83–109.Google Scholar
  • Gulrajani I, Ahmed F, Arjovsky M, Dumoulin V, Courville AC (2017) Improved training of Wasserstein GANs. Adv. Neural Inform. Processing Systems, 30.Google Scholar
  • Hartley J (1996) Brands Through the Lens of Style (Quest and Associates).Google Scholar
  • Hauser DJ, Schwarz N (2015) It’s a trap! Instructional manipulation checks prompt systematic thinking on “tricky” tasks. SAGE Open. 5(2):2158244015584617.CrossrefGoogle Scholar
  • Hekkert P, Snelders D, Wieringen PC (2003) ‘Most advanced, yet acceptable’: Typicality and novelty as joint predictors of aesthetic preference in industrial design. British J. Psych. 94(1):111–124.CrossrefGoogle Scholar
  • Heljakka A, Solin A, Kannala J (2018) Pioneer networks: Progressively growing generative autoencoder. Preprint, submitted July 9, https://arxiv.org/abs/1807.03026.Google Scholar
  • Heljakka A, Solin A, Kannala J (2019) Toward photographic image manipulation with balanced growing of generative autoencoders. Preprint, submitted April 12, https://arxiv.org/abs/1904.06145.Google Scholar
  • Hertenstein JH, Platt MB, Veryzer RW (2005) The impact of industrial design effectiveness on corporate financial performance. J. Production Innovative Management 22(1):3–21.CrossrefGoogle Scholar
  • Higgins I, Matthey L, Pal A, Burgess C, Glorot X, Botvinick M, Mohamed S, et al. (2017) beta-VAE: Learning basic visual concepts with a constrained variational framework. International Conference on Learning Representations.Google Scholar
  • Higgins I, Amos D, Pfau D, Racaniere S, Matthey L, Rezende D, Lerchner A (2018) Towards a definition of disentangled representations. Preprint, submitted December 5, https://arxiv.org/abs/1812.02230.Google Scholar
  • Homburg C, Schwemmle M, Kuehnl C (2015) New product design: Concept, measurement, and consequences. J. Marketing 79(3):41–56.CrossrefGoogle Scholar
  • Hu J, Shen L, Sun G (2018) Squeeze-and-excitation networks. Proc. IEEE/CVF Conf. on Comput. Vision and Pattern Recognition, (IEEE, Piscataway, NJ), 7132–7141.Google Scholar
  • Huang H, Li Z, He R, Sun Z, Tan T (2018) IntroVAE: Introspective variational autoencoders for photographic image synthesis. Adv. Neural Inform. Processing Systems 31(1):52–63.Google Scholar
  • Jang E, Gu S, Poole B (2016) Categorical reparameterization with Gumbel-Softmax. Preprint, submitted November 3, https://arxiv.org/abs/1611.01144.Google Scholar
  • Jindal RP, Sarangee KR, Echambadi R, Lee S (2016) Designed to succeed: Dimensions of product design and their impact on market share. J. Marketing 80(4):72–89.CrossrefGoogle Scholar
  • Jordan MI, Ghahramani Z, Jaakkola TS, Saul LK (1999) An introduction to variational methods for graphical models. Machine Learn. 37(2):183–233.CrossrefGoogle Scholar
  • Kang N, Ren Y, Feinberg FM, Papalambros PY (2019) Form + function: Optimizing aesthetic product design via adaptive, geometrized preference elicitation. Working paper, University of Michigan, Ann Arbor.Google Scholar
  • Kang WC, Fang C, Wang Z, McAuley J (2017) Visually-aware fashion recommendation and design with generative image models. Proc. IEEE Internat. Conf. on Data Mining (IEEE, Piscataway, NJ), 207–216.Google Scholar
  • Karjalainen TM, Snelders D (2010) Designing visual recognition for the brand. J. Production Innovative Management 27(1):6–22.CrossrefGoogle Scholar
  • Karras T, Aila T, Laine S, Lehtinen J (2017) Progressive growing of GANs for improved quality, stability, and variation. Preprint, submitted October 27, https://arxiv.org/abs/1710.10196.Google Scholar
  • Keller KL (2003) Brand synthesis: The multidimensionality of brand knowledge. J. Consumer Res. 29(4):595–600.CrossrefGoogle Scholar
  • Keng (2017) Semi-supervised learning with variational autoencoders. Accessed July 17, 2019, https://bit.ly/2O9RvF8.Google Scholar
  • Kingma DP, Ba J (2015) Adam: A method for stochastic optimization. Proc. 3rd Internat. Conf. on Learn. Representations.Google Scholar
  • Kingma DP, Welling M (2013) Auto-encoding variational Bayes. Preprint, submitted December 20, https://arxiv.org/abs/1312.6114.Google Scholar
  • Kingma DP, Mohamed S, Rezende DJ, Welling M (2014) Semi-supervised learning with deep generative models. Adv. Neural Inform. Processing Systems, 3581–3589.Google Scholar
  • Kreuzbauer R, Malter AJ (2005) Embodied cognition and new product design: Changing product form to influence brand categorization. J. Production Innovative Management 22(2):165–176.CrossrefGoogle Scholar
  • Krippendorff K (2011) Computing Krippendorff’s alpha-reliability. Working paper, University of Pennnsylvania, Philadelphia. https://repository.upenn.edu/asc_papers/43.Google Scholar
  • Kulkarni TD, Whitney WF, Kohli P, Tenenbaum J (2015) Deep convolutional inverse graphics network. Adv. Neural Inform. Processing Systems, 28.Google Scholar
  • Landwehr JR, Labroo AA, Herrmann A (2011) Gut liking for the ordinary: Incorporating design fluency improves automobile sales forecasts. Marketing Sci. 30(3):416–429.LinkGoogle Scholar
  • Larsen ABL, Sønderby SK, Winther O (2015) Autoencoding beyond pixels using a learned similarity metric. Preprint, submitted December 31, https://arxiv.org/abs/1512.09300.Google Scholar
  • Liu L, Dzyabura D, Mizik N (2020) Visual listening in: Extracting brand image portrayed on social media. Marketing Sci. 39(4):669–686.LinkGoogle Scholar
  • Liu X, Lee D, Srinivasan K (2019) Large scale cross-category analysis of consumer review content on sales conversion leveraging deep learning. J. Marketing Res. 56(6):918–943.CrossrefGoogle Scholar
  • Liu Y, Li KJ, Chen H, Balachander S (2017) The effects of products’ aesthetic design on demand and marketing-mix effectiveness: The role of segment prototypicality and brand consistency. J. Marketing 81(1):83–102.CrossrefGoogle Scholar
  • Lopez C, Miller S, Tucker C (2019) Exploring biases between human and machine generated designs. J. Mechanical Design 141(2):021104.CrossrefGoogle Scholar
  • Makhzani A, Shlens J, Jaitly N, Goodfellow I, Frey B (2015) Adversarial autoencoders. Preprint, submitted November 18, https://arxiv.org/abs/1511.05644.Google Scholar
  • Martindale C (1990) The Clockwork Muse: The Predictability of Artistic Change (Basic Books, New York).Google Scholar
  • Miyato T, Kataoka T, Koyama M, Yoshida Y (2018) Spectral normalization for generative adversarial networks. Preprint, submitted February 16, https://arxiv.org/abs/1802.05957.Google Scholar
  • Morren M, Paas LI (2020) Short and long instructional manipulation checks: What do they measure? Internat. J. Public Opinion Res. 32(4):790–800.CrossrefGoogle Scholar
  • Nobari A, Chen W, Ahmed F (2021) PcDGAN: A continuous conditional diverse generative adversarial network for inverse design. Proc. 27th ACM SIGKDD Conf. on Knowledge Discovery and Data Mining (ACM, New York), 606–616.Google Scholar
  • Noble CH, Kumar M (2010) Exploring the appeal of product design: A grounded, value-based model of key design elements and relationships. J. Production Innovative Management 27(5):640–657.CrossrefGoogle Scholar
  • Norman DA (2004) Emotional Design: Why We Love (or Hate) Everyday Things (Basic Books, New York).CrossrefGoogle Scholar
  • Oppenheimer DM, Meyvis T, Davidenko N (2009) Instructional manipulation checks: Detecting satisficing to increase statistical power. J. Experiment. Soc. Psych. 25:867–872.CrossrefGoogle Scholar
  • Orme B, Chrzan K (2017) Becoming an Expert in Conjoint Analysis: Choice Modelling for Pros (Sawtooth Software).Google Scholar
  • Orsborn S, Cagan J, Boatwright P (2009) Quantifying aesthetic form preference in a utility function. J. Mechanical Design 131(6):061001.CrossrefGoogle Scholar
  • Orth UR, Malkewitz K (2008) Holistic package design and consumer brand impressions. J. Marketing 72(3):64–81.CrossrefGoogle Scholar
  • Palazzolo M, Feinberg F (2015) Modeling consideration set substitution. Working paper, University of Michigan, Ann Arbor, MI.Google Scholar
  • Pan Y, Burnap A, Hartley J, Gonzalez R, Papalambros PY (2017) Deep design: Product aesthetics for heterogeneous markets. Proc. 23rd ACM SIGKDD Internat. Conf. on Knowledge Discovery and Data Mining (ACM, New York), 1961–1970.Google Scholar
  • Pauwels K, Silva-Risso J, Srinivasan S, Hanssens DM (2004) New products, sales promotions, and firm value: The case of the automobile industry. J. Marketing 68(4):142–156.CrossrefGoogle Scholar
  • Person O, Snelders D, Schoormans J (2016) Assessing the performance of styling activities: An interview study with industry professionals in style-sensitive companies. Design Stud. 42:33–55.CrossrefGoogle Scholar
  • Person O, Snelders D, Karjalainen TM, Schoormans J (2007) Complementing intuition: Insights on styling as a strategic tool. J. Marketing Management 23(9–10):901–916.CrossrefGoogle Scholar
  • Pfitzer S, Rudolph S (2007) Re-engineering exterior design: Generation of cars by means of a formal graph-based engineering design language. Proc. Internat. Conf. on Engrg. Design (Design Society).Google Scholar
  • Ranganath R, Tran D, Blei DM (2015) Hierarchical variational models. Preprint, submitted November 7, https://arxiv.org/abs/1511.02386.Google Scholar
  • Ranscombe C, Hicks B, Mullineux G, Singh B (2012) Visually decomposing vehicle images: Exploring the influence of different aesthetic features on consumer perception of brand. Design Stud. 33(4):319–341.CrossrefGoogle Scholar
  • Reid T, Gonzalez R, Papalambros PY (2010) Quantification of perceived environmental friendliness for vehicle silhouette design. J. Mechanical Design 132(10):101010.CrossrefGoogle Scholar
  • Reppel AE, Szmigin I, Gruber T (2006) The iPod phenomenon: Identifying a market leader’s secrets through qualitative marketing research. J. Product Brand Management 15(4):239–249.CrossrefGoogle Scholar
  • Rubera G (2015) Design innovativeness and product sales’ evolution. Marketing Sci. 34(1):98–115.LinkGoogle Scholar
  • Sbai O, Elhoseiny M, Bordes A, LeCun Y, Couprie C (2019) DesIGN: Design inspiration from generative networks. Leal-Taixé L, Roth S, eds. Computer Vision Workshops (Springer International Publishing, Cham, Switzerland), 37–44.CrossrefGoogle Scholar
  • Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. Preprint, submitted September 4, https://arxiv.org/abs/1409.1556.Google Scholar
  • Sohn K, Lee H, Yan X (2015) Learning structured output representation using deep conditional generative models. Adv. Neural Inform. Processing Systems. Google Scholar
  • Timoshenko A, Hauser JR (2019) Identifying customer needs from user-generated content. Marketing Sci. 38(1):1–20.LinkGoogle Scholar
  • Toffoletto G (2013) The Strategic Value of Design: A Model Derived from the Existing Literature and Six Case Studies of Design Driven Organizations (Politecnico di Milano, Milan).Google Scholar
  • Toubia O, Netzer O (2017) Idea generation, creativity, and prototypicality. Marketing Sci. 36(1):1–20.LinkGoogle Scholar
  • Ulyanov D, Vedaldi A, Lempitsky V (2018) It takes (only) two: Adversarial generator-encoder networks. Proc. 32nd AAAI Conf. on Artificial Intelligence (AAAI Press, Palo Alto, CA).Google Scholar
  • Vlasic B (2011) Once upon a Car: The Fall and Resurrection of America’s Big Three Auto Makers: GM, Ford, and Chrysler (William Morrow, New York).Google Scholar
  • Wu J, Zhang C, Xue T, Freeman B, Tenenbaum J (2016) Learning a probabilistic latent space of object shapes via 3D generative-adversarial modeling. Adv. Neural Inform. Processing Systems, 29.Google Scholar
  • Yoganarasimhan H (2017) Identifying the presence and cause of fashion cycles in data. J. Marketing Res. 54(1):5–26.CrossrefGoogle Scholar
  • Zhang R, Isola P, Efros AA, Shechtman E, Wang O (2018) The unreasonable effectiveness of deep features as a perceptual metric. Proc. IEEE Conf. on Comput. Vision and Pattern Recognition (IEEE, Piscataway, NJ), 586–595.Google Scholar
  • Zhang W, Yang Z, Jiang H, Nigam S, Yamakawa S, Furuhata T, Shimada K, et al. (2019) 3D shape synthesis for conceptual design and optimization using variational autoencoders. Preprint, submitted April 16, https://arxiv.org/abs/1904.07964.Google Scholar
  • Zhao J, Mathieu M, LeCun Y (2016) Energy-based generative adversarial network. Preprint, submitted September 11, https://arxiv.org/abs/1609.03126.Google Scholar