📝 Guest post: How to Select Better ConvNet Architectures for Image Classification Tasks*
Was this email forwarded to you? Sign up here There's no go-to formula for any ML project. A huge part of improving machine learning systems is experimenting until you find a solution that works well for your use case. However, an overwhelming amount of tools and model architectures are available for computer vision tasks. In this article, Superb AI’s team provides you with a strategy to select a suitable architecture for your image classification project. Image Classification Task OverviewImage Classification is a task that attempts to comprehend an entire image as a whole. The goal is to classify the image by assigning it to a specific label. Image classification can be considered a fundamental problem because it forms the basis for other computer vision tasks like object localization, detection, segmentation, etc. The ImageNet Challenge is the (unofficial) benchmark standard for most computer vision algorithms when it comes to image classification. It is an annual competition that uses subsets from a large dataset of annotated photographs comprising over 1 million images across 1,000 object classes, AKA ImageNet. Historically, machine learning on images had been a challenging task until 2012, when an extensive deep convolutional neural network called AlexNet showed excellent performance on the ImageNet Challenge and caught everyone’s attention. Ever since, convolutional neural networks, or ConvNets, have evolved very fast and surpassed human-level performance on visual tasks. Fast forward to 2022, and Vision Transformer or ViT models are currently topping the charts. Transformer architecture, initially proposed for natural language processing, adopts the self-attention mechanism - having the model focus on some part of the input text sequence when predicting each word. The Vision Transformer model adapts the attention idea to work on images, where the equivalent of words are patches that the input image is broken into. The concept of applying Transformer architecture to images is very interesting and potentially the source of breakthroughs. At the time of writing this blog, ViT models do scale well with datasets but are still problematic in small-scale datasets. We don’t understand them fully yet and are just starting to put ViT models into practice. ConvNets, on the other hand, were put into practice for more than a decade and have become very effective and popular due to the democratization of data, computational power, and abundance of tools and frameworks that followed. And since the goal of this article is to help practitioners to choose the model architecture for real-world computer vision problems, supervised image classification using ConvNets is what we will focus on. Selecting a ConvNet ArchitectureThere are plenty of ConvNet architectures for image classification that you can employ for real-world computer vision projects. However, choosing which model architecture to use is not always an exact science. But we should always start any machine learning project by setting up goals and expectations. Determine Desired PerformanceHow can one determine a reasonable level of performance to expect for your specific use case? Model accuracy and inference speed both matter to ML engineers: • Accuracy - We’d like to use a model that gives us the highest accuracy, which means a better experience for end-users. • Speed of model’s training and inference - We’d want the model to train and generate predictions as fast as possible because, in production, we might have to serve thousands of users and need to retrain the model quickly if any problems occur. Generally, we need to use a deeper (larger) model to get higher accuracy. But with a larger model comes a larger number of parameters that make it slower to execute. To visualize the accuracy/complexity tradeoff among common SOTA architectures in computer vision, you could refer to the Benchmark Analysis of Representative Deep Neural Network Architectures paper (published in 2018) that features a very useful chart. For most deep learning deployments in the industry, we should consider the accuracy level that is necessary for an application to be safe, computationally affordable, and appealing to users. For example, suppose you know that the end goal of your project is to deploy the model on edge devices. In that case, complexity is a more critical concern, and you might have to sacrifice accuracy a bit (MobilNet architecture is a good choice here). On the other hand, if training cost and prediction latency are not a concern for your project, and you are after the highest accuracy, you might want to choose the largest/fanciest model or even consider an ensemble of several model architectures. Once you have an idea of your realistic desired performance, your architecture search decisions will be guided by this goal. In the next section, let’s zoom in on some of the most commonly used SOTA ConvNet architectures. Popular ConvNet ArchitecturesThis section will briefly look at various ConvNet architectures widely used nowadays, i.e., their evolution, advantages, and disadvantages. VGG FamilyTo improve upon AlexNet, researchers started to increase the depth or number of convolutional layers. Adding more layers resulted in better classification accuracy. VGG architecture (proposed in 2014) had a deeper network structure and improved accuracy but also doubled its size and increased runtimes compared to AlexNet. It has 16 convolutional layers (and 19 in the later iteration) and uses 3x3 convolutions exclusively without losing accuracy. Why choose VGG architecture? Its simple architecture makes it relatively easy to understand and implement. It performs well on common image classification scenarios. VGG architectures have also been found to learn hierarchical elements of images like texture and content, making them popular choices for training style transfer models. InceptionThe Inception network came to the world in 2014. The main hallmark of this architecture is building a convolutional neural network while improving the utilization of computing resources. When lining up the convolutional and pooling layers in a network, the designer has multiple choices, and the best decision is not obvious. Instead of deciding beforehand which sequence of layers is the most effective, the Inception module provides several alternatives that the network can choose from based on data and training. Why choose Inception architecture? It’s a deeper CNN than VGG, with fewer parameters and significantly higher accuracy. This results in a shorter training time and smaller size of the trained model. On the other hand, this architecture can be pretty complex to implement from scratch. ResNetResNet or Residual Neural Network, introduced in 2015, continued the trend of increasing the depth of the models. One downside of adding too many layers is that doing so makes the network more prone to overfitting the training data due to the vanishing gradient problem. To solve this, ResNet introduced a novel residual module architecture with skip connections that allow information to flow from earlier layers in the network to later layers, creating an alternative shortcut path for the gradient to flow through. This fundamental breakthrough allowed for the training of extremely deep neural networks with hundreds of layers. The most common variant of ResNet is ResNet50, containing 50 layers, but larger variants can have over 100 layers. Why choose ResNet architecture? ResNet is widely used as it performs really well on classification tasks specifically. On top of that, it uses global average pooling, which accelerates the training (compared to the previous architectures like VGG). If you want something tried and true, this is a good choice. XceptionThe Xception model was proposed by Francois Chollet (creator of Keras) in 2017. This model is an extension of Inception architecture that replaces the inception model with depthwise separable convolutions and ResNet-style skip connections. Why choose Xception architecture? Xception offers a combination of both ResNet and Inception architectural features in a simpler design. This makes architecture easy to define and modify. Linear stack layers make training faster than Inception, and it slightly outperforms Inception in terms of accuracy on ImageNet. MobileNetThe MobileNet architecture was developed by Google to find a neural network suitable for edge devices such as smartphones or tablets. MobileNet architectures introduced two important innovations: depthwise separable convolutions and a width multiplier (hyperparameter). Depthwise, separable convolutions replaced regular convolution layers with fewer parameters and are more computationally efficient. The width multiplier is a parameter that controls how many parameters are used for each convolution layer. This allows for the creation of several networks along a tradeoff curve of size and speed versus accuracy. Why choose MobileNet architecture? An array of models can be created with this architecture so that more powerful devices can receive larger, more accurate models. In contrast, less powerful devices can use smaller, less accurate models. If you need a model that you can deploy on edge devices with scarce computational resources, MobileNet is the way to go. EfficientNetA later iteration of MobileNetV2 by the same team is called EfficientNet. Inverted residual bottlenecks are again the basic building blocks, but the optimization goal was tweaked towards prediction accuracy rather than mobile inference latency. Convolutional architectures have three main ways of scaling: use more layers, use more channels in each layer or use higher-resolution input images. The EfficientNet paper points out that these scaling axes are not independent, and this family of networks is scaled along three scaling axes rather than just one. Why choose EfficientNet architecture? This architecture is a workhorse of many applied ML teams because it offers optimal performance levels for every weight count. If you don’t have size/speed restrictions (inference on auto-scaling cloud system) and want the best/fanciest model, consider EfficientNet. Neural Architecture SearchFor much of the last decade, new state-of-the-art results were achieved with new network architecture manually created by researchers. As with many tasks that rely on human intuition and experimentation, someone eventually asked if a machine could do it better. Neural architecture search (NAS) automates the process of neural network design. Given a goal (e.g., model accuracy) and constraints (network size or inference latency), these methods rearrange composable blocks of layers to form new architectures. Though NAS has found new architectures that outperform their human-designed peers, the process is incredibly computationally expensive. Why choose NAS architecture? Besides being computationally expensive, the network architectures discovered by NAS techniques typically don’t look like those designed by humans. We don’t know why they work well, so we cannot move some ideas to other architectures. Nevertheless, NAS achieves high accuracy, and, in some cases, that’s enough.
Want to discuss your particular use case? Reach out to our team, and we can put you on the right track! *This post was written by Dasha Gurova, community manager at Superb AI, and originally posted here. We thank Superb AI for their ongoing support of TheSequence.You’re on the free list for TheSequence Scope and TheSequence Chat. For the full experience, become a paying subscriber to TheSequence Edge. Trusted by thousands of subscribers from the leading AI labs and universities. |
Older messages
📃🪄🖼 Edge#204: Inside Imagen. Google’s Impressive Text-to-Image Alternative to OpenAI’s DALLE-2
Thursday, June 30, 2022
Imagen provides a simpler architecture able to generate photorealistic images from language inputs
🎂🎉🦾 Happy Birthday TheSequence!
Wednesday, June 29, 2022
50% OFF to celebrate our 2nd anniversary!
🔴🟨 Edge#203: What are Graph Recurrent Neural Networks?
Tuesday, June 28, 2022
+ what GNNs on Dynamic Graphs; and the exploration of DeepMind's Jraph, a GNN Library for JAX.
📝 Guest post: Right-Sizing Training Workloads with NVIDIA A100 and A40 GPUs*
Monday, June 27, 2022
In this article, CoreWeave's team explains how it is helping companies deploy more timely and efficient AI applications by right-sizing projects to optimize training on both NVIDIA A100 and A40 GPU
📹 🤖 Transformers for Video
Sunday, June 26, 2022
Weekly news digest curated by the industry insiders
You Might Also Like
What Investors Want From AI Startups in 2025
Monday, November 25, 2024
Top Tech Content sent at Noon! How the world collects web data Read this email in your browser How are you, @newsletterest1? 🪐 What's happening in tech today, November 25, 2024? The HackerNoon
GCP Newsletter #426
Monday, November 25, 2024
Welcome to issue #426 November 25th, 2024 News LLM Official Blog Vertex AI Announcing Mistral AI's Large-Instruct-2411 on Vertex AI - Google Cloud has announced the availability of Mistral AI's
⏳ 36 Hours Left: Help Get "The Art of Data" Across the Finish Line 🏁
Monday, November 25, 2024
Visual Capitalist plans to unveal its secrets behind data storytelling, but only if the book hits its minimum funding goal. View Online | Subscribe | Download Our App We Need Your Help Only 36 Hours
DeveloPassion's Newsletter #180 - Black Friday Week
Monday, November 25, 2024
Edition 180 of my newsletter, discussing Knowledge Management, Knowledge Work, Zen Productivity, Personal Organization, and more! Sébastien Dubois DeveloPassion's Newsletter DeveloPassion's
Meet HackerNoon's Latest Features: Boost Stories with Translations, Speech-to-Text & More
Monday, November 25, 2024
Hey, Hacker! HackerNoon's monthly product update is here! Get ready for a new version of the mobile app, more translation developments, a new AI Gallery, backend moves, and more! 🚀 This product
The ultimate holiday gadget gift
Monday, November 25, 2024
AI isn't hitting a wall; $70 off Apple Watch; 60+ Amazon deals -- ZDNET ZDNET Tech Today - US November 25, 2024 Meta Quest 3S Why the Meta Quest 3S is the ultimate 2024 holiday present This $299
Deduplication in Distributed Systems: Myths, Realities, and Practical Solutions
Monday, November 25, 2024
This week, we'll discuss the deduplication strategies. We'll see whether they're useful and consider scenarios where you may need them. We'll also do a reality check with the promises
How to know if your data has been exposed
Monday, November 25, 2024
How do you know if your personal data has been leaked? Imagine getting an instant notification if your SSN, credit card, or password has been exposed on the dark web — so you can take action
⚙️ Amazon and Anthropic
Monday, November 25, 2024
Plus: The hidden market of body-centric data
⚡ THN Recap: Top Cybersecurity Threats, Tools & Tips (Nov 18-24)
Monday, November 25, 2024
Don't miss the vital updates you need to stay secure. Read the full recap now. The Hacker News THN Recap: Top Cybersecurity Threats, Tools, and Practices (Nov 18 - Nov 24) We hear terms like “state