Digital signage and personal shopping assistant
APPLICATION
The personal shopping assistant application is the core of the Mall digital signage. It will exploit open-world re-identification and person detection and tracking in the Mall to detect the items of interest of each individual. Combined with the analysis of personal traits, both clothing and behaviors, will support a decision on the appropriate personalized contents to display on the digital signage when in front of the terminal. The interactivity of the person at the digital signage terminal and the analysis of the emotional status will drive the display of contents that will eventually fit the person’s interests. The application should be capable of suggesting the appropriate products as item, style, color…. This is a clear alternative to statically reproduced advertising and we believe it is a valid tool in crowded shopping centers or retail shops where there is not enough personnel to serve all the customers appropriately.
The quality of suggestion is a crucial part for a recommendation system. Accuracy of suggestions alone (for example, recommending a clothing which lies in the same class of the outfit the user is wearing) may not be enough to find the most relevant items for each user.
One of the goals of recommender systems is to provide a user with highly idiosyncratic or personalized items, and more diverse recommendations result in more opportunities for users to get recommended such items. With this motivation, new suggestions methods that can increase the diversity of recommendation sets for a given individual user profile will be studied, in particular developing algorithmic techniques for improving individual and aggregate diversity of suggestions, still achieving high accuracy. In particular, neighborhood-based and matrix factorization-based collaborative filtering will be considered.
We expect that the outcomes of open-world re-identification and person detection and tracking, extraction of personal traits and feelings will make digital signage contents both personalized and dynamic, namely variable depending on the time at which the digital signage is observed during the stay in the Mall.
We will consider that personalization of the contents must be made in real-time, within reduced time limits (we can determine such limits from the distance of each individual to the digital signage terminal). This imposes clear performance constraints on the functional modules that will be defined in the early phase of the project. On the other hand, recognition of the customer’s emotional status and feeling must respond in strict real-time to drive the terminal interactivity. The display of the appropriate digital signage requires the design of the repository and appropriate indexing.
We are not aware of similar digital signage infrastructures neither of usage of Computer Vision and AI to perform online extraction of individual profiling in real contexts. We agreed to test the whole application at a VERONAFIERE event, under their courtesy.
Open world re-identification
STATE of the ART
Traditional face recognition assumes a closed-set scenario. Probe images contain identities that were enrolled in the gallery. A more realistic scenario is open-set where probes may contain subjects not enrolled in the gallery and the recognition system must detect and reject such probes. A more frequent scenario is the open-world where identities are learned incrementally and unsupervisedly in the gallery as soon as they are observed. In this scenario, the approaches for closed and open-set are not suited. Parametric learning methods like deep networks are natively designed to perform closed-set recognition and have been adapted in a few cases to the open-set. Recent research Neural Turing Machines showed that deep networks may be enhanced by an external memory for quick integration of information about new items, ensuring that salient but statistically infrequent data are stabilized in the class representation. Such a model does not anyway support learning from video.
NOVELTY and METHODOLOGY
We plan to use deep networks to perform detections and extract representations of face observations, an external memory module to incrementally collect them and a smart filtering mechanism that assigns observations to identities and decides their relevance to be learned. We use the memory to break up the temporal correlation between consecutive instances disrupting the non-iid nature of video data. In this way, identity clusters are incrementally built putting together frequent and rare observations with no reference to their temporal occurrence. Since the individual outfit is used to learn personality, we exploit the outfit feature to improve the purity of the clusters. Continuity of appearance of face and outfit in consecutive frames will be used as a form of self-supervision to decide the instantiation of a new identity. Scalability of the method in large settings such as a Mall with a large number of individuals will be a key problem to address. Open-world recognition has been addressed so far by very few researchers. None of them provide satisfactory or well-validated results. None of them addressed open-world re-identification from the video. Our approach will ground on preliminary research results of UNIFI team that provide good confidence on the feasibility of the method.
Detection and tracking
STATE of the ART
Face and body detection and tracking are unavoidable support for behavior analysis. Impressive advancements in tracking have been obtained using deep networks to extract descriptors of the target appearance and online re-train the model. Appealing approaches used two-stream Siamese network and Recurrent Neural Networks to re-identify individuals on a frame-by-frame basis without learning any target-specific appearance model. However, most of such approaches for tracking focus on single object tracking with a single camera, and it is unclear to what extent deep learning techniques can be adopted in scenarios where group behaviors affect persons’ trajectories. Moreover, for tracking in large areas such as Malls, we believe that 360° video could be very useful. Very little work has been done on 360° video processing, and the adoption of 360° cameras in practical applications is still limited. Works in discuss the application of this technology in virtual cinematography and sports video, respectively.
NOVELTY and METHODOLOGY
We will explore deep learning for tracking in a camera network context, where multiple individuals are moving either alone or in groups with occlusions, while interacting with objects or individuals. This is an open research issue. We will study the use of 360° video in such context following our preliminary research on this subject. Given a 360° video we will instantiate several virtual PTZ cameras (vPTZs), one for each detected individual, and formulate tracking as the problem of controlling vPTZs to focus on the individuals in the Mall. We will use deep learning to regress the parameters of the vPTZs. We will check the possibility to direct attention toward significant and relevant situations. Instead of tracking all the individuals, the system might selectively track only a smaller subset that is estimated as more relevant. The selection of which customers to track might be taken by solving an optimization problem and/or adopting reinforcement learning to reward the system each time it has focused its attention on customers who positively responded to digital signage suggestions.
Extraction of personal temporary feelings
STATE of the ART
Understanding the feeling of an individual while looking a specific advertising content, a shopping window or a particular product in the shelf is very important for the retailers in order to understand the degree of appreciation of a particular product. Most of the face expression analysis solutions in the literature consider still image faces acquired in controlled conditions by a traditional surveillance camera and detect the traditional seven classes of emotion Joy, Sadness, Anger, Fear, Surprise, Contempt, and Disgust. Few of them have addressed emotion in faces extracted by videos in real environments with overexposure or underexposure lighting conditions or blurring due to movements. There is also little attention to computational requirements have real-time operations as required in real contexts. In a condition like those of interest in our research, we should assume that the individuals are not cooperative, they don’t look at the camera and observations can be affected by blurring as well as different lighting exposures.
NOVELTY and METHODOLOGY
We will consider people looking at digital signage both outside and inside the Mall, so analyzing faces taken in a closed field. We will consider age, sex, ethnicity as obtained from FM 3 and we will extract personal feelings by analyzing face spatio-temporal evolution. We will use facial landmarks, located at eyes, eyebrows, nose, mouth and jawline and their temporal evolution to recognize the emotional trends and understand feelings of an individual while looking at particular advertising contents training deep LSTM recurrent classifiers. We will design algorithms to detect and track the eyes of the person and estimate the degree of attention and the association of a particular emotional state to a particular display. The extracted temporary traits will be used both to drive the interactivity of the display and to provide statistics to the retailers on the shop items.
Extraction of personal stable traits
STATE of the ART
Personal traits are stable social signals whose value doesn’t change suddenly, like personality and social status. Social signal processing (the marriage between pattern recognition + social psychology) has mainly focused on gestures, posture, gaze, physical appearance, and proxemics to extract social signals, while other features like clothing have been only addressed as having high potential in communicating social status and personality. In facts, Clothing Semiotics science has confirmed that clothing is an evident blueprint of individuals, being completely dependent on their conscious choices, not as transient as a gesture, and more evident than any micro-signals such as facial expressions. In a strict computer vision sense, the extraction of personal traits from clothing requires a fine-grained recognition of what a person is wearing, and the identification of a fashion style. For the first problem, we should segment an outfit into its items. Such clothing parsing has recently received much attention, also due to the interest of key players like Amazon and Zalando. However, solutions are limited to simple settings such as a single person in the center of the image, with a neutral standing pose. No dataset of clothings have images with multiple people in crowded situations where occlusions are severe. For the second problem, the definition of style is blurry. However prototypical style classes as chic, sexy, casual are commonly used in fashion and used in computer vision for classification tasks. Each outfit may have a score indicating how much it is on that style, but very little research on style description has been done in computer vision. On the other hand, personal traits are also extracted from person analysis (age, sex, ethnicity) and behaviors. Group detection and analysis can be helpful to this end, highlighting body poses and focus of gazing that can be indicative of roles and familiarity between group members. Moreover, the analysis of the nonverbal behavior of the individuals in a group gives the possibility to highlight personal traits and social attitudes that can not be otherwise identified analyzing the persons alone.
NOVELTY and METHODOLOGY
Our goal is to investigate novel models that embed notions of social psychology studies on Clothing Semiotics and behavior analysis research results into computer vision to perform profiling of people, in a systematic way considering the Mall scenario. This will be used in order to personalize the suggestions either for single individuals or groups (family with kids, couple, friends). We will address clothing parsing in crowded scenarios by adopting deep learning technology of fully convolutional networks. Our research will focus on how to obtain plausible segmentations compliant to the body shape. Segmentation will be driven by pose estimation. Pose estimation should be robust to crowded situations as well. In our previous research, we have verified the effectiveness of using bi-clustering to detect stable segments. We believe these can be assumed as clothing signatures of the individual and also allow to discover trends in the crowd as clothing items shared among the people. So far, no other clustering or bi-clustering procedure has been applied for clothing parsing, and clothing parsing has never been applied to crowded situations. For each parsed individual, we will measure his/her similarity against a set of already classified prototype individuals. Each of these classes (chic, casual, etc.) is represented by different outfits, each one of which has a different score indicating how much they are on that style. Depending on the similarity expressed, personal traits will be associated in terms of personality traits and social status, according to clothing semiotics findings. We plan to verify a novel matching mechanism based on spatially local-modular metric learning. The distance between two outfit images will depend on the pixel similarity between the two images, but also by the type of clothing items that are matched together. We will adopt existing solutions for the estimation of age, sex, ethnicity (they can be considered as almost solved problems in computer vision). Instead, we will try to extract stable personal traits
from behavior. We plan to study novel methods for the analysis considering gestures, posture and gazing, either by the single or the group and paths and walking trajectories. Recent advances on person pose estimation allow extracting people joints’ positions with great accuracy, giving the possibility to overcome the problems of occlusions and high-density scenarios. We will use this information to analyze activities, recurrent person nonverbal behavioral patterns and interactions with objects, in order to extract individual social signals and their interpersonal attitudes.