3 tricks for efficient data science

It’s amazing how little things can turn a data science / mathematical modelling project into a full-fledged mess.  There are easily avoidable traps, but unfortunately also easily forgotten: we love to pretend we “couldn’t do something that silly”.

Following the principles below has saved me (and others) a lot of time and spared me from unnecessary pain:

Always check your data

No one has ever been arrested for excess logging. If you don’t check the input for your models and the outputs of calculations at the intermediate steps, you are in for disappointing results. 

There is a promise of a greener future when deep learning will release us from all the mess of pre-processing, and we can ingest raw data in any format and the intermediate layers will sort it out. Sorry, but here on real life we are still too far from that.  Garbage in, garbage out.

Test your model on limit cases

This is somewhat related to the previous point, but not always. What happens to your model in the limit? Many scientists ask these kind of questions as a basic sanity check. The same should be done: does your model predict what the domain expert / common sense expects when you make one variable too large or too small? If not, then there is something wrong. Do you have a positive regression coefficient for monthly totals, but a negative for average totals? Time to revisit your model.

A common source of errors here is the scale: having two variables on wildly different scales often produces odd, counter-intuitive models.

Fail fast

A common error, specially in highly technical teams, is to over-engineer prematurely. There is a justified fear from that: if the data science experiments become too messy, then it will be too complicated for the engineers to implement them when it’s time to scale. But very rarely this justifies discarding the working Python prototype for writing everything from scratch in C.

Another common error is to try the “tool of the day” and build the solution in a poorly supported technology. There are thousands of poorly maintained and documented open source projects.

Don’t reinvent the wheel. Bet on tested technologies to iterate quickly and, ultimately, bring value to your final user, whether a consulting client, or someone else internally in your organization.  Once the project is on its way and there’s trust on the final user side, there will be plenty of time to explore alternatives.

 

4 reasons to invest in open-source data science training

With the ever-changing technology landscape come lots of challenges for companies. Specially around data science, where we are witnessing a constant expansion in technologies and tools. Every company is becoming increasingly data-driven, and the best way to remain relevant in the next five years is to be prepared.

What concrete advantages does open-source data science bring to your organisation?

1. Develop your own use cases.

As the dust settles down and clear winners emerge (R & Python), organizations should get ready to jump into data science, identifying relevant use cases for their businesses. The focus is less and less on technologies and is finally moving onto bottom-line value. There is no one better prepared than your own, trusted and experienced employees to come up with the right use cases for data science in your business. You do not need an expensive consultancy, you have already the experts!

2. In-house customer support.

Although “open-source software” is understood as free, like in “free-beer”, it comes to a cost. Yes, you don’t have to pay the hefty yearly license, but you have to customize your software to your infrastructure and needs. A large chunk of the revenue for open source vendors comes from the ongoing support. The way out? Train your own technology experts.

3. Increased productivity and reduced employee turnover.

A happy employee is an employee that stays with you. Open source technologies are here to stay, and smart, valuable data scientists and data analysts want to work with those. Proprietary software limits their creativity and that would eventually make them leave. It is not pleasant to be locked down with the limited tooling that vendors offer, no matter how “easy” they make it sound.

4. Trained employees can be promoted.

Although there are signs that data science starts to be commoditized, with an data science and machine learning bootcamps and even master and PhD degrees everywhere, you can not really expect that any of such “instant experts” will be able to take a senior role. Investing in your employees now will save you money in the future when you are fully committed to embrace advanced analytics in your business, as the new joiners will find strong mentors and leaders within your team.

 

Interested in upgrading your analytics / open-source software in-house capabilities? Reach out!