COMPUTER VISION APIS
Learn more about the use of different computer vision technologies to automate elaborate manual processes and how they are used in our products like APIs to generate digital fingerprints of images and videos.
Computer vision is a huge interdisciplinary field that deals with the extraction of high-level information from digital images and image sequences such as videos. The information of interest can vary, ranging from purely technical to the adaption of human visual perception in a more general sense.
ENGINEERING- VS. MACHINE LEARNING-BASED APPROACHES
Higher level computer vision can be divided into engineering-based and machine learning-based approaches. The distinction between these is blurred, but engineering-based approaches are usually designed as a processing chain with—not necessarily all of—the following steps:
GENERATING DIGITAL FINGERPRINTS
Image pre-processing (e.g. low-pass and high-pass filtering)
Segmentation (e.g. thresholding and the application of color models)
Feature detection (e.g. edges, corners, and blobs)
4. Measurement or pattern matching
5. Result post-processing (e.g. outlier filters, consistency checks, and refinement)
SEARCHING VISUAL CONTENT WITH AN ENGINEERING-BASED APPROACH
Most applications for computer vision technology require measurements and/or so-called pattern matching. The extraction and matching of local features for the purpose of pattern matching became popular toward the end of the last millennium, one of the groundbreaking methods being Scale-Invariant Feature Transform (SIFT) features.
PATTERN MATCHING BASED ON LOCAL FEATURES
According to the outlined processing chain, such local features are extracted in Steps 1 and 3 and matched in Steps 4 and 5; Step 2 is not usually applied for local feature extraction and matching. The information generated in Steps 1–3 are referred to as an image’s digital fingerprint. The location and other geometric information of such features such as the rotation, translation, and scale are commonly referred to as "key point" and are determined by a so-called key point detector. The visual image content that belongs to a key point is usually represented as a feature vector, which is commonly referred to as descriptor. The RANSAC algorithm and other consistency checks are usually applied for Step 5.
Pattern matching based on such local features is usually understood as a computer vision (vs. machine vision) approach. There are various kinds of detectors and descriptors that differ in terms of accuracy, size, detection speed, matching speed, and the type of structures that they can handle. Many binary descriptors that allow for fast matching have been proposed recently.
VISUAL CONTENT IDENTIFICATION USING LOCAL AND GLOBAL FEATURES
Local features are mostly used for object recognition and pose estimation as well as image retrieval. The benefits of such approaches are that they do not require image segmentation (Step 2), which is often problematic in real-world applications, and they can naturally handle partial occlusions. However, their applicability is restricted to images that offer the kind of features that can be detected by the key point detector. In practice, most photographs and videos certainly do offer these; however, things such as logos usually do not.
IMPLEMENTATION IN THE TECXIPIO COMPUTER VISION APIS
The features we have developed for the Reverse Search APIs are fast to compute, compact in size, and allow for high-speed matching with the aid of highly optimized data structures, making the perfect tradeoff between efficiency and quality. Matching on the basis of local features can be regarded as the golden standard for general purpose image and video retrieval, but is rarely used because of its computational complexity. Our optimized implementations allow us to benefit from this technology at a reasonable cost.
Approaches based on global features are used for applications that do not require all benefits of local feature matching but for which speed is of the utmost importance. “Global” here means that an image is not represented by a set of local features but by a single feature vector. This allows for a considerable speedup in matching at the cost of lesser robustness to partial occlusion. Simple global features are only applicable for finding more or less exact duplicates (except for scale), while sophisticated types of global feature are computed on the basis of a set of local features, thus preserving most of the properties of local features and even allowing for partial occlusions to some degree.
APPLICATIONS OF COMPUTER VISION APIS
Computer vision technology based on pattern matching with local features, as described above, is highly effective for identifying images or videos in large databases. It enables large-scale searches for visual files when IDs, Metadata, or further descriptive information is missing, incomplete, insufficient, or unreliable. Especially when only based on descriptive text tags, modern image and video identification processes still involve a high degree of manual work. Digital fingerprinting and matching algorithms have significantly accelerated working processes, which saves valuable time and resources while ruling out human error rates.
Therefore, computer vision technology is increasingly implemented in areas such as media identification (e.g. for anti-piracy measurements) and monitoring (e.g. for ad tracking), spam/upload filters, quality control, and the management of large media archives.
GENERATE DIGITAL FINGERPRINTS WITH COMPUTER VISION APIS
VISUAL CONTENT RECOGNITION WITH COMPUTER VISION APPROACHES
Face detection and face recognition are other popular computer vision disciplines; while the task of a face detector is to find any (unknown) face in an image, a face recognizer identifies (known) faces—i.e. known persons—which is a classification task. It is common for a face recognizer to operate on the candidates detected by a face detector. The classification task of a face recognizer is to recognize different images of a certain person that belong to that same person, despite varying lighting conditions, pose, and looks.
A better example of an actual classification task is the recognition of a certain type of animal, such as dogs. The task of the classifier is then not only to handle the different appearances of the exact same animal but to have a notion of dogs in general so that it can recognize whatever breed of dog is a dog. Another common example besides animals is furniture such as chairs and tables.
TRAINING CLASSIFIERS WITH TRAINING DATA
All these tasks have in common that the classifier must grasp what is common for that class, but just as importantly what is not. Using the example of a dog, engineering-based methods try to tackle this in a bottom-up manner, i.e. extract suitable low-level features, maybe try to detect legs, head, and tail, and then check for geometric consistency. In contrast, machine learning approaches learn their representation based on a large number of images of a dog (positive examples) and a large number of non-dog images (negative examples). The choice of the training data is crucial for the resultant classifier. For instance, if a system was trained with negative examples that were only landscape images, i.e. it has never seen any animal other than a dog, it will probably “assume” that a cat is also a dog. This is natural—it would also be the case for a human learner.
VISUAL PERCEPTION TASKS WITH DEEP CONVOLUTIONAL NEURAL NETS
Over the last few years, deep convolutional neural nets (CNN) have become very popular; these kinds of learning approaches are referred to as deep learning. Due to the convolutional topology and the modern availability of computational power, visual perception tasks can be solved with considerably higher recognition performance compared to those of conventional artificial neural nets. This technology allows for the creation of powerful recognition and more recently also localization systems, which can—for instance—be used to automatically tag images and videos.
MACHINE LEARNING APPROACHES TO IMAGE SUPER-RESOLUTION
The applications of CNNs are not restricted to recognition and localization tasks, but can also be used for image enhancement and super-resolution among others. In contrast to engineering-based methods that are mainly based on interpolation, machine learning approaches to image super-resolution can “guess” structures within a low-resolution image and insert them in the super-resolution output, which results in considerably sharper photo-realistic results, whereas interpolation techniques produce blurry results.
PECULARITIES OF CNNs
While CNNs can solve many problems that were previously considered unsolvable, it is important to mention that even though these kind of neural nets and their training algorithms are mathematically well described and fully understood, they remain a black box to a certain degree—a black box that is being trained with a large amount of training data and a set of training parameters that are tuned by experience but also to some degree by trial and error. The circumstance of dealing with a black box leads to the fact that even very impressive deep learning systems with extremely high true positive rates and extremely low false positive rates can still produce dazzling results in rare cases, e.g. recognizing a dog in a noise image with high probability, which is something that a human observer could never do.