|
Hello! Once a week, we write this email to share the links we thought were worth sharing in the Data Science, ML, AI, Data Visualization, and ML/Data Engineering worlds.
If this newsletter is helpful to your job, please become a paid subscriber here: https://datascienceweekly.substack.com/subscribe :)
And now…let's dive into some interesting links from this week.
How Deepnote uses Deepnote At Deepnote, we think of the notebook as a universal computational medium. It’s easy to get started with, but allows for great composability. As a result, the notebook can serve not just as a tool for data exploration, but also as an elegant building block for a company’s entire data platform…In this post, we’ll explore how we at Deepnote use Deepnote to power our entire data infrastructure, which we like to think is organized as a pyramid structure…
Pareto and Pandas This post muses about what it means to learn a software library. I’ll use Pandas as an example, but the post isn’t just about Pandas. Suppose you say “I want to learn Pandas.” That implicitly assumes Pandas one thing, and in a sense it is. In another sense Pandas is hundreds of things. At the top level, the pandas module (version 1.2.0) has 142 things inside…
Favourite sets of (unpublished) lecture notes which can be found freely online? (Twitter / X) what are your favourite sets of (unpublished) lecture notes which can be found freely online? i am happy to hear about both introductory and more advanced topics. i am personally thinking { maths, stats, cs, physics } stuff, but honestly just share whatever!…
There are seven core components of an A/B testing stack, but if they’re not all working properly, it can mean your company isn’t making the right decisions. Meaning teams aren’t shipping features that are actually helping customers, the org is leaving money on the table, and you’re likely doing lots of manual work to validate experiment results. Now imagine using a reliable experimentation system like the ones developed by Airbnb, Netflix, and many other experimentation-driven companies. You’d be able to: Work with the most accurate and up-to-date metrics, completely in your own data warehouse Leverage business metrics (revenue, activations, and more) over shallow metrics (clicks, signups, etc.) easily and quickly Save time in setting up, running, and analyzing experiments Help your non-data teams to self-serve experiment set up and analysis
Most successful tech companies rely on the same seven building blocks of an experimentation platform, and we put together a white paper that walks through these core components and the opportunities and failure modes they each present. Download the white paper to see if you have all seven, and if you don't, what you could be missing. * Want to sponsor the newsletter? Email us for details --> team@datascienceweekly.org
Four ways to streamline your R workflows Finding ways to reduce manual tasks when programming, like copying and pasting files or code, can save you time and minimise the risk of errors. This blog post guides you through a few small changes to your R workflow to help reduce manual tasks and streamline your programming workflows in R…
DocLLM: A layout-aware generative language model for multimodal document understanding Enterprise documents such as forms, invoices, receipts, reports, contracts, and other similar records, often carry rich semantics at the intersection of textual and spatial modalities. The visual cues offered by their complex layouts play a crucial role in comprehending these documents effectively. In this paper, we present DocLLM, a lightweight extension to traditional large language models (LLMs) for reasoning over visual documents, taking into account both textual semantics and spatial layout. Our model differs from existing multimodal LLMs by avoiding expensive image encoders and focuses exclusively on bounding box information to incorporate the spatial layout structure…
Why don't we have more interesting activation functions? [Reddit] There's not too much evidence that biological neural networks have unusual activation functions (say mod n), but with so many connections which may be wired differently to how we do activation functions and attention, who can know? I do not think extremely strong negative inhibitive weights play this role; it's different to have an all or nothing mod function that may not be learnable up a negative weight gradient..
Machine Learning for Smart and Energy-Efficient Buildings Energy consumption in buildings, both residential and commercial, accounts for approximately 40% of all energy usage in the United States, and similar numbers are being reported from countries around the world…In this work, we review some of the most promising ways in which ML has been leveraged to make buildings smart and energy-efficient. For the convenience of readers, we provide a brief introduction to the relevant ML paradigms and the components and functioning of each smart building system we cover. Finally, we discuss the challenges faced while implementing machine learning algorithms in smart buildings and provide future avenues for research in this field…
TransformEHR: transformer-based encoder-decoder generative model to enhance prediction of disease outcomes using electronic health records Deep learning transformer-based models using longitudinal electronic health records (EHRs) have shown a great success in prediction of clinical diseases or outcomes. Pretraining on a large dataset can help such models map the input space better and boost their performance on relevant tasks through finetuning with limited data. In this study, we present TransformEHR, a generative encoder-decoder model with transformer that is pretrained using a new pretraining objective—predicting all diseases and outcomes of a patient at a future visit from previous visits…
List of R packages for animal tracking data I've created a list of R packages for animal tracking data that will help us develop movement analysis tools for managers. I am seeking additional suggestions (esp re: robustness to data gaps)…
Stuff we figured out about AI in 2023 2023 was the breakthrough year for Large Language Models (LLMs). I think it’s OK to call these AI—they’re the latest and (currently) most interesting development in the academic field of Artificial Intelligence that dates back to the 1950s. Here’s my attempt to round up the highlights in one place!…
A tile map showing a month of streamflow conditions across the U.S. A tile map showing a month of streamflow conditions across the U.S. - how to and GitHub repository…
Why does nothing ever get used? [Reddit] Dashboards, views, tables, pipelines, entire data marts. Why does 90% of the work I do never get used?…I used to be one of the best BA's in my entire company so I am very good at requirements gathering and understanding what the business is trying to accomplish…Six months ago I just.... stopped doing QA.. I have been relying on the "scream test", I mark tickets resolved and immediately move to prod and only do QA if someone screams that something is wrong. I have yet to hear back on anything…
Inferring the number of floors for residential buildings Data on the number of floors is required for several applications, for instance, energy demand estimation, population estimation, and flood response plans. Despite this, open data on the number of floors is rare, even when a 3D city model is available. In practice, it is most often inferred with a geometric method: elevation data is used to estimate the height of a building, which is divided by an assumed story height and rounded. However, as we demonstrate in this paper with a large dataset of residential buildings, this method is unreliable: <70% of the buildings have a correct estimate…We propose several indicators (e.g., construction year, cadastral attributes, building geometry, and neighborhood census data), and we present a predictive model trained with 172,000 buildings in the Netherlands. Our model achieves an accuracy of 94.5% for residential buildings with five floors or less, which is an improvement of about 25% over the geometric approach…
It's 2024 and they just want to learn The state of the ML communities big and small starting 2024. My general expectations for the year…
JaxMARL: Multi-Agent RL Environments in JAX In this paper, we present JaxMARL, the first open-source code base that combines ease-of-use with GPU enabled efficiency, and supports a large number of commonly used MARL environments as well as popular baseline algorithms. When considering wall clock time, our experiments show that per-run our JAX-based training pipeline is up to 12500x faster than existing approaches. This enables efficient and thorough evaluations, with the potential to alleviate the evaluation crisis of the field…
We have removed the “Jobs” section of the newsletter as we don’t think it adds much value to you. Hit reply and let us know what section you want us to start including :) A recap of exciting stories from the 20-30 subreddits in related spaces? Highlights of interesting AI tools from the 20-30 newsletters in related spaces? TikTok / Instagram / Etc. Posts from Data/AI Influencers leave this section empty - the email is already long enough as it is :) others?
New marketplace for verifiable machine intelligence, leveraging zkML to ensure accuracy, verification, and IP protection for modelers, Spectral has launched its first-ever model-building challenge for data scientists to help address societal issues by leveraging open-source to produce high-performing ML models. The models built from this specific challenge will have massive implications for the crypto industry as we know it. A $100k bounty is on the line as well as an 85% revenue share for the model they built. Engineers can sign up now, and expect more challenges on the way for early 2024.
Apply here Want to post a job here? Email us for details --> team@datascienceweekly.org
Building A Graph Convolutional Network for Molecular Property Prediction In this article, we will explore the basics of one particular ML model — a graph convolutional network — through the lens of chemistry. This is not meant to be a mathematically rigorous exploration; instead, we will try to compare features of the network with traditional models in the natural sciences and think about why it works as well as it does…
Marking telomeres on a simple ideogram in R I was recently running the telomere identifier Tapestry on some genome assemblies to assess how close they were to chromosome-level. Tapestry already outputs an ideogram of assembly scaffolds marked with telomeres, but I wanted to customise the plot to make it clearer when showing collaborators. There are a couple of R packages that can produce ideograms, including karyoploteR and ggbio, both of which I’ve dabbled in. But in this case I really wanted something even simpler than what either of those packages offer, and instead made a very basic telomere-marked ideogram using just ggplot2…
Can LLMs Replace Data Analysts? Getting Answers Using SQL In the previous article, we’ve started building an LLM-powered analyst. We decided to focus on descriptive analytics and reporting tasks since they are the most common for analysts…The next step would be to teach our LLM-powered analyst to get any metrics. Analysts usually use SQL to get data. So, the most helpful skill for the LLM analyst would be interacting with SQL databases. We’ve already discussed OpenAI functions and learned how LLMs can use tools to integrate with the world. In this article, I would like to focus on LLM agents and discuss them in more detail. We will learn how to build agents using LangChain and try different agent types…
* Based on unique clicks. ** Find last week's issue #527 here.
Looking to get a job? Check out our “Get A Data Science Job” Course A comprehensive course that teaches you everything related to getting a data science job based on answers to thousands of emails from readers like you. The course has 3 sections: Section 1 covers how to get started, Section 2 covers how to assemble a portfolio to showcase your experience (even if you don’t have any), and Section 3 covers how to write your resume.
Promote yourself/organization to ~60,000 subscribers by sponsoring this newsletter. 35-45% weekly open rate.
Thank you for joining us this week! :) All our best, Hannah & Sebastian
P.S. Was today’s newsletter helpful to your job?
Consider becoming a paid subscriber here: https://datascienceweekly.substack.com/subscribe :)
Copyright © 2013-2024 DataScienceWeekly.org, All rights reserved.
Invite your friends and earn rewardsIf you enjoy Data Science Weekly Newsletter, share it with your friends and earn rewards when they subscribe. Invite Friends | |