June 04, 2025
By Hubert Brychczynski
Artificial Intelligence,
Software Engineering,
Generative AI,
Machine Learning,
Data Engineering,
Devops
This is part 2 of our engineering experiment: testing AI in ML, Data, and DevOps.
For a deeper dive into how AI impacts front end and back end development, read Part 1.
Anthropic CEO Dario Amodei recently wrote: “When a generative AI system does something, we have no idea, at a specific or precise level, why it makes the choices it does—why it chooses certain words over others, or why it occasionally makes a mistake.”
If even the CEO of an AI company says we don’t fully understand why generative AI does what it does, how are engineering teams supposed to make smart decisions about integrating it with their workflows?
At Janea Systems, we turned to experimentation. After working with AI-assisted coding for some time, we posed the inevitable question: how much is it actually worth?
We recruited four expert- to senior-level engineers for each of the following five domains: front end, back end, machine learning, data engineering, and DevOps. The twenty participants completed specific coding tasks with and without AI, then submitted quantitative and qualitative feedback on their performance and experience.
This is the second and final installment in our series discussing the experiment’s results:
Here’s a list of tasks that each domain expert tackled:
We suspect that performance gains across all five domains were influenced by engineers’ domain knowledge, tool familiarity, and prompt engineering proficiency. That impact, however, was particularly notable in the three domains discussed in this article:
This may also explain why the results in these three domains - especially DevOps - appear underwhelming compared to the previous two.
Table 1 presents average self-reported assessments of expertise, tool proficiency, and prompt engineering familiarity across all domains.
Table 1: Engineer self-assessment
Figure 1 illustrates the domain-by-domain speedup. Machine learning and data engineering saw a 24% and 10% uptick, respectively. However, DevOps engineering experienced a decline, with AI slowing progress by 5%.
Fig. 1: Task performance improvement across domains
Figure 2 shows the proportion of AI-generated solutions that worked out of the box. Machine learning worked in 87.5% of cases, followed by data engineering and DevOps at approximately 75% and 50%, respectively.
Fig. 2: Percentage of AI-generated solutions working out of the box
None of the AI-generated solutions across these three domains worked perfectly. In every case, engineers needed to spend additional time refining or adjusting the output to make it viable. The following figures show how much effort went into this fine-tuning process.
Figure 3 reflects how much time engineers spent refining AI-generated solutions, where “1” indicates extensive time spent and “5” indicates minimal time.
Fig. 3: Time spent improving AI-generated solutions
Figure 4 shows how many changes engineers made to AI-generated solutions, where “1” indicates few changes and “5” indicates many.
Fig. 4: Number of changes made to AI-generated solutions
The use of AI in machine learning offered familiar advantages, but also revealed distinct limitations. Use cases included initial code structure generation, neural network architecture scaffolding, and auto-suggestions. Here too, AI was best suited for generating boilerplate and supporting quick prototyping.
Machine learning was also the first domain where AI faltered frequently enough that continuous, critical human oversight became essential. AI-generated code, statistics, and data were often plainly wrong, illogical, broken, outdated, or inconsistent. Moreover, models struggled when handling large datasets and complex data scenarios.
Data engineers saw a modest 10% improvement when solving tasks with AI. This modest result, however, warrants context. Participants’ self-reported proficiency in the technologies used was intermediate, averaging just 2.875, even though their general domain expertise was high (4.0). Additionally, only one in four participants had studied prompt engineering. Both factors likely contributed to the more limited gains observed.
That said, AI did provide "technically correct" starter templates and accelerated solution validation. Even when AI suggestions fell short, engineers were able to leverage them to finish tasks faster. Nonetheless, outdated, incorrect, or incomplete AI-generated responses frequently prompted additional debugging, refinement, and iterative prompting.
DevOps engineers reported the lowest domain expertise (3.25) and only moderate tool proficiency (3.5), with just one in four having studied prompt engineering. These factors likely impacted completion times, as participants needed to iterate prompts and spend additional effort verifying AI outputs.
Still, the engineers praised AI’s ability to generate infrastructure code snippets, standardize scripts through references to documentation, suggest best practices, and accelerate drafting of deployment pipelines for routine CI/CD steps.
These advantages were consistently offset by extensive manual correction when AI responses proved overly generic or off-target. Certain recommendations also caused Azure configuration mismatches, requiring extra troubleshooting. Taken together, these issues compounded and often made engineers spend more time fixing AI-generated code than they would have spent building it from scratch.
Machine learning, data engineering, and DevOps did not see anywhere near as dramatic an improvement from using AI as front end and back end. In fact, DevOps was slower by 5%. Machine learning and data engineering accelerated by 24% and 10%, respectively.
These improvements are not insignificant, but they pale in comparison with the 66.94% in front end and 55.93% in back end.
The truth is that the more an engineer knows about their domain, the tools of the trade, and prompt engineering, the more effectively they use AI for coding. Investing in engineer education is the way forward.
What’s 24% today could be 50% tomorrow - and 80% the day after. We believe there’s always room to improve, and we have experience to back it up::
We re-engineered Microsoft Bing’s deep learning pipelines, making TensorFlow 50x faster and accelerating training by 7x.
We designed and implemented a future-proof data architecture with Delta Lake and SCD Type 2 tracking, enabling large-scale predictive modeling and AI analytics.
We enabled PyTorch support on ARM64 architecture, facilitating AI development on new Windows machines and AI applications on the edge.
We don’t just play with AI - we make it better.
Ready to discuss your software engineering needs with our team of experts?