Skip to main content

Launch HN: Aquarium (YC S20) – Improve Your ML Dataset Quality https://ift.tt/3ew54Xm

Launch HN: Aquarium (YC S20) – Improve Your ML Dataset Quality Hi everyone! I’m Peter from Aquarium ( https://ift.tt/3dwAufn ). We help deep learning developers find problems in their datasets and models, then help fix them by smartly curating their datasets. We want to build the same high-power tooling for data curation that sophisticated ML companies like Cruise, Waymo, and Tesla have and bring it to the masses. ML models are defined by a combination of code and the data that the code trains on. A programmer must think hard about what behavior they want from their model, assemble a dataset of labeled examples of what they want their model to do, and then train their model on that dataset. As they encounter errors in production, they must collect and label data for the model to train on to fix these errors, and verify they're fixed by monitoring the model’s performance on a test set with previous failure cases. See Andrej Karpathy’s Software 2.0 article ( https://ift.tt/2hsOCzx ) for a great description of this workflow. My cofounder Quinn and I were early engineers at Cruise Automation (YC W14), where we built the perception stack + ML infrastructure for self driving cars. Quinn was tech lead of the ML infrastructure team and I was tech lead for the Perception team. We frequently ran into problems with our dataset that we needed to fix, and we found that most model improvement came from improvement to a dataset’s variety and quality. Basically, ML models are only as good as the datasets they’re trained on. ML datasets need variety so the model can train on the types of data that it will see in production environments. In one case, a safety driver noticed that our car was not detecting green construction cones. Why? When we looked into our dataset, it turned out that almost all of the cones we had labeled were orange. Our model had not seen many examples of green cones at training time, so it was performing quite badly on this object in production. We found and labeled more green cones into our training dataset, retrained the model, and it detected green cones just fine. ML datasets need clean and consistent data so the model does not learn the wrong behavior. In another case, we retrained our model on a new batch of data that came from our labelers and it was performing much worse on detecting “slow signs” in our test dataset. After days of careful investigation, we realized it was due to a change to our labeling process that caused our labelers to label many “speed limit signs” as “slow signs,” which was confusing the model and causing it to perform badly on detecting “slow signs.” We fixed our labeling process, did an additional QA pass over our dataset to fix the bad labels, retrained our model on the clean data, and the problems went away. While there’s a lot of tooling out there to debug and improve code, there’s not a lot of tooling to debug and improve datasets. As a result, it’s extremely painful to identify issues with variety and quality and appropriately modify datasets to fix them. ML engineers often encounter scenarios like: Your model’s accuracy measured on the test set is at 80%. You abstractly understand that the model is failing on the remaining 20% and you have no idea why. Your model does great on your test set but performs disastrously when you deploy it to production and you have no idea why. You retrain your model on some new data that came in, it’s worse, and you have no idea why. ML teams want to understand what’s in their datasets, find problems in their dataset and model performance, and then edit / sample data to fix these problems. Most teams end up building their own one-off tooling in-house that isn’t very good. This tooling typically relies on naive methods of data curation that are really manual and involve “eyeballing” many examples in your dataset to discover labeling errors / failure patterns. This works well for small datasets but starts to fail as your dataset size grows above a few thousand examples. Aquarium’s technology relies on letting your trained ML model do the work of guiding what parts of the dataset to pay attention to. Users can get started by submitting their labels and corresponding model predictions through our API. Then Aquarium lets users drill into their model performance - for example, visualize all examples where we confused a labeled car for a pedestrian from this date range - so users can understand the different failure modes of a model. Aquarium also finds examples where your model has the highest loss / disagreement with your labeled dataset, which tends to surface many labeling errors (ie, the model is right and the label is wrong!). Users can also provide their model's embeddings for each entry, which are an anonymized representation of what their model “thought” about the data. The neural network embeddings for a datapoint (generated by either our users’ neural networks or by our stable of pretrained nets) encode the input data into a relatively short vector of floats. We can then identify outliers and group together examples in a dataset by analyzing the distances between these embeddings. We also provide a nice thousand-foot-view visualization of embeddings that allows users to zoom into interesting parts of their dataset. ( https://youtu.be/DHABgXXe-Fs?t=139 ) Since embeddings can be extracted from most neural networks, this makes our platform very general. We have successfully analyzed dataset + models operating on images, 3D point clouds from depth sensors, and audio. After finding problems, Aquarium helps users solve them by editing or adding data. After finding bad data, Aquarium integrates into our users’ labeling platforms to automatically correct labeling errors. After finding patterns of model failures, Aquarium samples similar examples from users’ unlabeled datasets (green cones) and sends those to labeling. Think about this as a platform for interactive learning. By focusing on the most “important” areas of the dataset that the model is consistently getting wrong, we increase the leverage of ML teams to sift through massive datasets and decide on the proper corrective action to improve their model performance. Our goal is to build tools to reduce or eliminate the need for ML engineers to handhold the process of improving model performance through data curation - basically, Andrej Karpathy’s Operation Vacation concept ( https://youtu.be/g2R2T631x7k?t=820 ) as a service. If any of those experiences speak to you, we’d love to hear your thoughts and feedback. We’ll be here to answer any questions you might have! July 13, 2020 at 08:05PM

Comments

Popular posts from this blog

Show HN: Infstream – We’re trying to fix video monetization for creators https://ift.tt/34Rcd11

Show HN: Infstream – We’re trying to fix video monetization for creators TL;DR: https://ift.tt/2VFChrA Hi HN – we’re Ben & Callum from Infstream. We’ve always been heavy users of YouTube, for entertainment, education and sharing. Towards the end of last year, we saw more and more horror stories of YouTubers losing their livelihood to the ad algorithm. We decided to build a content-first video platform, which aims to reduce issues by removing advertisers from the equation. Instead, we charge for the content you watch – bold, I know. Instead of paying in advertising and data, users on Infstream build their own streaming package, a channel at a time. Anyone can start a channel (US & UK now, Europe soon) and earn directly from their subscribers. Subscribers pay $1 per month per channel, of which the channel receives $0.75. This all begins from the first subscriber, there are no minimums to start monetization. Channels have total control, and can publish on a daily, weekly or monthl...

Show HN: Teddy Bear Tracker iOS App https://ift.tt/34MIiHn

Show HN: Teddy Bear Tracker iOS App Two weeks ago when walking around my neighborhood I noticed a strange amount of teddy bears placed in the windows of homes. When I got home I searched the internet and found https://ift.tt/2URjc5m describing that this was being done to provide additional entertainment for people going on walks during these times of social distancing. This past week I decided to repurpose some old code into an app that would allow me to keep track of the teddy bears I found while on my own walks. It's quite simple but I hope others can get some enjoyment out of it! :) Here is the Apple App Store link: https://ift.tt/3al5kpV April 19, 2020 at 10:26PM

Show HN: MailPhantom – Keeping your email address invisible https://ift.tt/2Lc2z02

Show HN: MailPhantom – Keeping your email address invisible Been reading HN for some time now, but this would be my first post. https://ift.tt/2zkYqEh Copy and past from the site: ######### The use of unique password are considered best practice, why are we not doing this with email addresses as well. MailPhantom aims to achieve this, with an added benefit, you'll see which service providers or mailing lists are sharing your email addresses. ######### This is basically a MVP, and may likely break somewhere. But if there is a lot of interest I may build/work on it more. I have used it in its current state for a few months now. I welcome any feedback :) ^C May 10, 2020 at 12:59PM