Hey everyone, welcome to the Coding Intelligence Machine Learning and AI Group Lab. I'm Shashank, I'm part of the evangelism team here at Apple, and I'm joined today by a fantastic panel of experts from across the machine learning framework teams. We'll start off with a quick round of introductions, but I'd like to make it a little more interesting. There have been a lot of exciting machine learning and AI announcements at WWDC, agency, agents in Xcode, new CoreAI framework, new evaluations framework, foundation models framework updates, MLX distributed, and so much more. So each of you, as you introduce yourself, please share one thing that you are most excited and you think developers would be most excited about from your area. Yeah. KEVIN VLK: I'm Kevin, and I work on Xcode. One of the things that I'm most excited about is that I get to actually build software that I couldn't really build before, whether it's something that was maybe I just didn't have time for, or maybe it was something new, like a new technology, like one of your technologies that I get to adopt in an app and try out. Agents let me do it efficiently and learn new things. Cool. Eric? We're just going down the line? All right. Hi, everybody. My name is Eric. I work on the Foundation Models Framework. And I would say, man, we've got a lot coming out this year. But the thing that I'm most excited about, personally, is the language model protocol that allows you to plug in all kinds of different inference backends to the framework. So you can call out to MLX, CoreAI. We've got a Google package that just went live. Anthropic released theirs this morning. There's so many options now. And developers like you guys are going to be able to add your own integrations as well. So it's going to open up tons of possibilities. I'm really excited about that. Exciting. Hi, everyone. I'm Steven. I'm an engineer on the evaluations framework team. I'm personally really excited about the entire concept of evaluations and bringing that to you all. One thing in particular is the Model Judge Evaluator, just how easy we've made it for you to configure Model Judge Evaluator to rate the quality of the responses that you're getting from your LLM-based features. Cool. Hi, everyone. My name is Raciel. I work on Core AI. Selfishly, I will say that the most exciting feature is the launch of Core ii. I got into building AI frameworks because I wanted to bring my research to production. And I built a few. And I think this is the best one so far. And also because it powers a lot of our features, like Eric was mentioning, the foundation model framework, Siri. So we get to contribute to everything we're building in Apple. Thanks. Hi, everybody. I'm Angelus. I'm in the MLX team. And what I'm most excited about-- I mean, I've been saying this all the past few days to people that asked me. I'm really excited about local AI this year. I think we're basically at the point where local AI is starting to become useful. And it's not something like a gimmick, like a chat, like a small thing you can do in your CLI. You can actually do work with it. You can actually have local agents. You can have the FM chat on the app even to do other things on your devices. Basically, I think it's the year of local AI. So I'm really excited about that. And of course, I'm always excited about MLM, so no need to mention that. Thank you, thank you. All right, in addition to those you see on screen, there's a whole team of experts behind the scenes helping triage all your questions. So please keep them coming. We'll get to as many as we can today. For code-specific questions or anything we can't get to today, I encourage you to head over to our developer forums at developer.apple.com/forums and ask there. And if you have a bug report or a feature request, please file those using feedbackassistant.apple.com so that we can use the time today for questions that are more broadly applicable to everyone. All right. So with that, let's get started with our first question. Here's a question from Jane Chow. Could you explain the roles of Core AI, Core ML, and MLX in simple terms? From a beginner's perspective, how should one understand them and decide which one to learn or use? Okay, who wants to take this one? I can give the high level overview. Go for it. And I will make it broader, right? Like here is not mentioning, for example, the foundation model framework. And I think what we're trying to do at Apple is build this comprehensive suite of different technologies that basically the developers can approach at different levels. You're going to start right at the topmost with the foundation model framework. If you want to use and your use case is well served by an LLM that we provide or that you can plug in there, then that works great. Then if you will have something more custom, either a model that you have, you train, or you download it from, you know, one of the multiple repositories, you should try Core AI, right? Like it comes with a lot of SLAs and guarantees, particularly if you're building an application. And then obviously there is MLX, which is extremely successful and doing really powerful stuff, right? Like the demos that Angelus has shown about, you know, distributed with multiple machines, that's super exciting. And you can build another type of use cases with that. So that's what we're trying to do. Specifically with Core ML, I think moving forward, we're asking everybody that work with neural networks to use Core AI. Core ML stays there. But I think right now it's going to remain focused on just the traditional ML, like decision trees, that type of stuff. But everything new should be moving to Core AI. Thanks, Rasiel. Any foundation-specific nuance or MLX-specific nuance to contribute to that answer? Yeah, so specifically if you're doing things with an LLM, language model stuff, start with foundation models, try the system language model, see if that gets you there, use evaluations to show for sure whether it gets you there or not. If that doesn't work, you've got private cloud compute. And then if you need something more custom, you want to bring your own model, You want to use an open source model. Try plugging in core AI through foundation models so that you have the same API. It's just like a one- and two-line switch, and you're off to the races. If you're doing things that are not language models, so you're doing diffusion models, you're doing image segmentation, drop down to core AI. If you need it, drop down to MLX. Does that sound pretty clear? No, that's exactly that. I think it's pretty obvious. When you're doing a language model stuff, you should take advantage of the language model protocol. It allows you to change and try things, local, remote, obviously, right? And I would say maybe the only thing that hasn't been mentioned, which could be kind of interesting for people, is that if you want to train, for instance, a common device, that is a use case that's kind of way easier in MLX or unique to MLX this year. Yeah, and that would be a reason to use, for instance, the MLX adapter for foundation models. Yeah, exactly. Things like that. So yeah, definitely start foundation models. Definitely use evaluations to know which of these models actually is the one that covers your use case. Because why go PCC? Why go to a remote model if the on-device model already covers your use case? Yeah, exactly. So why bring something when it's already there? It's already there. Fixes it. Yeah, I think that's a good summary. I will mention that we have several sessions this year that cover all these topics. We have sessions on MLX that talk about how to connect four Mac studios to run a trillion parameter model, run local agentic loops. Not on the phone, guys. Not on the phone. And there's Core AI sessions on bringing models from PyTouch, also integrating into apps, and PCC and dynamic profiles. A lot of sessions, and maybe we'll get to some of these if you all have specific questions. So thank you. Thank you. So let's move on. Here's another question. This is from Abhi27. What is the on-device foundation model's context window in iOS 27? Is the input and output counted against one shared token budget? Sounds like a question for you, Eric. I can answer that one. So the context size is the same as it was before. It's 4096, and it is a shared window. So if you input 4000 tokens, your response can go up to the remaining 96. That's about what it boils down to. And PCC? Oh, and PCC is 32K. So it goes all the way up to 32K. Same thing. It's a shared budget. Okay, so from a developer's point of view, if they want a larger context window, we would recommend using the PCC. And if they have a need for deeper reasoning capabilities and those sorts of things. Yeah, exactly. So you're more choice this year, although the window for one device is the same. Yeah, exactly. And if you really want to go crazy with like huge context sizes, plug in something from MLX or AI or grab one of those server language model packages, go up to like a million tokens. I do want to mention the MLX team and the Core AI team have a package, Swift package, that lets you use a foundation model framework by just switching the model. So you can bring in your own model asset, bundle it with your app, and then take advantage of whatever the token context window is for that model. It's why Eric is excited about the language model protocol. Yeah. And on that note, I'm going to steal this moment to make a plug. Both of those packages are open source. And so very soon, so will be the foundation models framework. So there's a great story here around all the open source components plugging together and being able to learn from the source code. Should be really neat. That's a pretty good point, right? There are great examples for other people to just look at the code, see what we did there. Maybe they want to do something very special. They have really good examples there. Yeah, absolutely. We actually have a session this year that not only shows use these different packages, but also how to build your own. Yes. Bring your own LLM provider. So if you have a custom ML model framework, then you can conform to this protocol and bring your own. Yeah. And you can really cheat. So here's a tip for you. If you want to make your own language model and plug it into the framework, go into Xcode, open up your agentic coding. We have a skill that we've open sourced that will help your agent go in and implement the language model protocol. And you can also point it at both of the open source repos for MLX and Core AI and give it to your agent as a source of inspiration. And it should be able to do a pretty good job putting the pieces together for you. And, of course, you have to use the evaluation framework to decide which model is. We got you covered. All right, let's move on. Another good question. Thank you. Thank you, Abhi. Another question from Abhi27. This is, can foundation models run inside background app refresh task or background processing task, especially while the phone is locked, asleep, or the app has been backgrounded for a while? Ooh, that's a really good question. Good nuanced question here. Yeah, that's a good one. In the background, yes, it can run. So you can do it in like a background task. There is a possibility, particularly if the OS is busy doing other things, it may rate limit you. So you may catch a rate limited error from the system language model that tells you you're done for a little bit. So just wait and try again in the future. On macOS, you should be good as long as you're in the foreground. And with the private cloud compute language model, You may hit rate limiting for other reasons. There's rate limiting that happens because the system's busy. And then there's rate limiting that can happen because you've sent too many requests in a very short period. And those manifest as different kinds of errors in the API. So you've got to tell them apart. So a developer will be able to catch them and appropriately show that kind of message. Just because I didn't get it. So on the MacQuest, are you also rate limited or you're never rate limited with the local foundation model? With the local foundation model, you won't get limited as long as your app's in the foreground. Okay. Do I have to worry about quality of output if I'm rate limited? Or is it just a matter of time? It's just a matter of if it comes out or not. You'll get the same response in the foreground and the background. It just might take a little longer in the background. Yeah. Sounds good. Another question from Dessa. For Mac OS 27, Apple Intelligence, what does waitlist mean? Both city local and PCC models are working. Are we getting different models on the wait list? Does this beta include AFM core advanced 20 billion model? Thanks. Oh, that's like they snuck in two questions within one. So who's going to take that one? Everybody's looking at you. We're the closest to Apple intelligence, but yeah. So on the record, I don't work directly on Siri, so I'm not fully qualified to answer this, but I do know the answer. Then I should plug in. We do have a group lab for Apple Intelligence, I believe, later this week. So please bring this question again. I'll tell you the answer. Spoiler alert. So the wait list applies only to Siri. It doesn't apply to private cloud compute language model or to any of the things that Siri does on device. Okay. You knew the answer very well. I do. We just wanted to plug the other group. And as a bonus, I'll answer the second question, too. The second question was, does the beta include AFM core advanced? The answer is yes. It's used for the voice features and stuff. Cool. Great. This question is from IndigoJ. The Foundation Models Framework now supports bringing your own LLM provider alongside the on-device model and private cloud compute. Can you mix all three within a single agentic flow? And what are the data privacy and attribution boundaries once a third-party provider is in the loop? Ooh. This is such a good question. I'm really glad that you asked this one. So thank you, IndigoJay, for that one. You should start by plugging your session. Yes. So there's a really great session. The guy who delivers it is awesome. It's... Oh, I just forgot the name, but help me out, Shachanka. I think Eric... Building agentic experiences with foundation models. Yes. Building agentic experiences with foundation models. Give that one a watch. It has the answer to the question, but I'll recap it here. We have a new API called Dynamic Profiles that allows you to route to different models using a declarative API. On Apple platforms, we have the system language model. We have private cloud compute. They're appropriate for different kinds of things. And there's great reasons to start doing something with the system language model, but then switch to the private cloud compute language model when things get more complicated. And you may even want to pull out a third-party language model like you just suggested. We try to make those boundaries extremely clear in the API. It's declarative. And that declarative nature makes it really easy to reason about which model is going to wind up doing each task within that workflow that you're talking about. And so in the session that I encouraged you to watch, you'll see that we talk a lot about designing the boundaries between the models, designing your handoffs based on things like performance, privacy, and then if you're using a third-party model, things like cost come into play as well. So this may be a better answer by watching the session, but does this mean that the whole context is going to be sent to each model? Oh, that's a really good question. Because that would be the main privacy concern, right? Yeah. So we talk about two patterns. There's obviously lots of patterns you can use to build agents, but the two that we call out in the session are baton pass and phone a friend. And baton pass is... It's not very private. It does not sound very private. It is not. Your hand off right there. Yeah, so baton pass, every agent or every dynamic profile is involved in the same race. Like when you're running a relay race, you know the person before you. You see them coming. You know everything that they're doing. When they hand off the baton to you, you have full context on everything that's happening. And then you take that baton and you run across the finish line. Like once you have the baton, you don't give it back. You're crossing the finish line. You're in charge now. And so that is the way that you want to orchestrate things. If you intend the full context to be shared, you know there's nothing in there that's privacy sensitive. Or if you're only using the on-device model, you're only using private cloud compute, you don't have to worry about it. And so that's a great way to go. If you want to establish those privacy boundaries, you know that you're handing off to like an untrusted model or just one that has a weaker privacy guarantee, then you want to use Phone a Friend. Phone a Friend is like when you're on Who Wants to Be a Millionaire? And you get a question from Regis, and the question's like, hey, what's the name of that statue in New York? So I get on the phone and I call up Shashank. I'm like, Shashank, I can't remember the name of that statue in New York. It's a big green lady. Shashank has no view into the previous questions that I've been asked. He just knows what I'm asking him right now. And so that preserves the privacy of the beginning of the game show, right? A key part of that is once Shashank says, oh, that's the Statue of Liberty. It's back to you. It's back to me. And I have to give the final answer. So there's a little bit of extra repetition in there. from time to time. Would it be okay to say it's like tool calling basically? Yeah, it is. You're tool calling a bigger model or a different model. It could be smaller. Tool call with an ephemeral session. Yeah. And then you get back the answer and you repeat it. So there's great privacy benefits to doing that. It also gives you an extra context window. So if you need a little bit more size, you can get it that way. Yeah. Like a whole bunch of stuff. Yeah, yeah, yeah. You're giving the cheat code. How do you manage that between models that have different context windows? Yeah. So another great question. You'll see in the Foundation Models Framework this year, we've added this notion of profile modifiers. And so you can hook in on things like handoffs, and you can also do it just purely in a declarative way that says, for example, right, I want to save on context, so I'm just going to keep the last 10 entries in the transcript. Or as soon as a tool call output has been used to produce a response, we're just going to drop that tool call from the one thing. It doesn't need to be in the transcript anymore. Yeah, and these can be done in functional ways. So it's a stateless transform. So if you go to the on-device model, you're just looking at the last 10 entries in the transcript. But if you bounce back to private cloud compute, you've got a lot more context size to work with. And so everything before that suffix comes back into play. and you have access to the full context. And so there's really flexible ways to go about doing this. It's a view into the transcript, but just the last few. Yeah, yeah, yeah. Makes sense. Wow, that was good. I think a nuanced discussion. Yeah. All right. Next question is from Claire Casey. I'm a new developer. My app uses on-device speech-to-text and must recognize names, proper nouns that general models miss. Does iOS automatically personalize recognition to each user learning their words and pronunciations over time, or do I need to build and maintain that list myself? Who's qualified to answer that one? This refers to a few different possible frameworks across speech and I don't know. Does anyone? I don't think we might not have the expertise for that one on this panel. At a higher level, it feels like there are two sort of subtopics here. One is the ability to do all these online learning. I think what Siri might do at an OS level. The other is what you do within the app, right? If there's things that you do inside the app, then that is on the developer to manage. I know I used to do research on speech recognition. So in general, typically when you have a speech recognizer, you have also typically another component that does your personalization component. Like, for example, you have a smaller language model that does personalization. So that's something that you can fine-tune just based on the examples. This is very common when you have your contact list, right? For example, my name is really hard for any automated language model to know how to pronounce Raziel. right so that's something that gets learned so that will be the component that whether our API supports or not I just don't know categorically whether we do I will assume that we do yeah yeah we don't know right no one here represents those frameworks but just to sort of close this out would bringing your own custom model it's possible to build system like this? Definitely you can build it your own. Yeah, the question is whether you have to build it. Do you want to do it or if there's a native? Okay. I would also think that unless it's one of the languages that's not supported, that maybe is the reason. But if it's one of the languages that's already supported by the Spitzer Commission on our devices, I think it's one of the not the, like I would just use what is supported. I think it's going to be already pretty good, right? I would recommend and also just asking this question on the developer forum. We've got engineers watching that like a hawk all week long. This week. So you've got a very good chance of getting the answer from the person who really knows it. Exactly. Sounds good. Next question from Pichaya underscore T-R-Y-Y-S. How do you train coding agents to know more about my code style specific area? I use local LLM, Gwen, in LLM Studio with Xcode slash VS Code. Continue. to learn my complex code base, which contains vision OS, metal, physics simulation, and macro that generates complex 3D resources, but it does not perform very well. Ooh. I'll take this one. This is a good one. Yeah. I think there's a general principle here that I like, which is how do agents learn anything? And I think one of the things we like to talk a lot about is search and learning. The more tools that you give agents the ability to search and find things, then they can learn. And you also can help teach them how to learn by even just writing down and documenting what they do. A lot of agents learn a lot of how we learn, right? Like if I was trying to teach you something, taking notes is a great way to remember something so you don't have to go search everything over again to learn the same topic. So when it comes to coding style and the kinds of, I don't know, the different APIs you like to use or just even your syntax that you like to have, first, the agents are going to be really good just out of the box by seeing source code that is already in your project and copying that. And they'll try to copy that as close as they can already just of what's there. The other thing that you can do is you can provide the ability for the agent to always have some examples every single time it runs a query. So different agents have different files that can get automatically included, like an agent's MD file or a Claude MD. That file, though, is in every single query that you run inside of Xcode. So that context window will get eaten up by that file every time. So you want to keep it really short. You wouldn't want to put, like, your entire style guide in there every single time. going to be eating up a lot of tokens. So then what you can do is you can reference other files inside of your agent's MD file. So you can say, oh, there's a style guide. And here's where it is on disk. Or whenever you're working in this area of the code base, go check out this other markdown file. And so the agent will see that in your agent's MD. And then it will say, oh, actually, I need to go find some information about what it means to work in this area. Go off and find that particular markdown file, et cetera. So it's really just about search. Go find markdown files and then learning. And as it goes along, you can also tell it things like document all the things that you're seeing in these markdown files. So if it's seeing like your network layer for the first time, tell it to like write down all the assumptions that it sees as it goes along. And then you have that. You can correct it. And you can also keep it up to date. And then you can point it at it and say like, hey, every time you do networking, go look in this networking thing first to kind of learn how to do those tasks. Oh, that's a great tip. Like I find sometimes when I create a new conversation, it doesn't necessarily remember what I had done in another project. And if you have a file like that, you can kind of get it to build up consistent behavior across everything that you're working on. Yeah. I like that. Yeah, many files. Yeah, yeah. And you can modify it like per your entire global or you can do it per project, right? So I feel like the onus is on the developer a little bit to continuously teach these agents what is my style and what is preferred for this particular project. for my coding style, what does my team prefer? And the more that you can do that, the better that you're going to get that agent to conform to the style that you like. Yeah. Yeah, like Eric was saying, for building a new provider for the foundation model framework, you can have a skill, then you point to examples, like what we have for MLX and Core AI, and then it learns how to do that. Now, one thing I'll add, which is it can be overwhelming think about, oh, agents MD and all these documents and skills. Like, I have to, like, build this big, you know, corpus of them. Sometimes that can be helpful. But whenever a new model comes out, our recommendation is try not using any of it. Try starting over. And so actually developing the ability to, or I don't want to overuse the word skill, developing the skill to learn which skills you actually need is a good skill to have. because then as new models come out, as they train new things that run locally or new frontier models, you can kind of have a habit of checking, oh, do I actually need all that information? Because you might not need it because the model has gotten advanced enough to share it. Yeah, they learn new APIs as well. They maybe, if you're using, especially if you're using a local model that is a bit older, for instance, it may know the new APIs, right? And then it learns them. So no need to document them explicitly anymore. Now, speaking of Apple APIs, - That's actually a great point. Thank you for bringing that up on to us. Xcode has a really extensive documentation that we search on. So you can say things like search documentation for information about this API. So even if you're using a model that was even trained a while ago that hasn't seen our new APIs, you can bring all the new APIs from the new version of iOS and Mac OS, et cetera, right into Xcode. And it will be able to see all those by doing tool calls to search for those APIs and learn what's new to learn how to use them and pull them into your project. Oh, that's really useful. Especially during betas like this. People underestimate how these things actually make... I like to say, anybody that is doing local models where you can actually try old models and see that with the new tools, they're actually suddenly much better. A huge improvement that has happened lately is not actually in the models, but it is in all these things, all the tools that we give them now that we didn't use to give them like two years ago, right? Yeah, the rigging around them. Yes, exactly, all the things around them. So basically, if you take a model that is even a few years old and you put it in the new Xcode. New hardness. Yeah, it's going to be so much better than even last year's Xcode experience. So that's pretty good. Yeah, you get grounding, you get running the code, seeing if it's working, debugging, then going back into an agentic loop and fixing the things, running it again. It's incredible. And you can even use a simulator. Yeah, yeah. and then get it to remember all the things that it tried. So I think the question mentioned some crashes. As you see crashes, have it document what it did to solve it. The next time it comes through, it can use it as almost a memory to remember, like, oh, I should avoid that particular pattern, and the agent gets it right the first time more often. There's a small nuance in the question where they say, I use a local LLM Quen in LM Studio with Xcode, right? And then they mentioned that the code base has all these metal and multiple things. So I think depending on what model Xcode is hooked up to, the ability of the model to have a larger context window or smaller, if it's squint running locally on your MacBook, context window is smaller. So it may not be able to replicate the style of the entire, may not be able to read everything. I would say from personal experience, obviously we don't know the specific model, we don't know the code base. I think it would be pretty good, especially on style replication and things like that, even smaller models that you can run on your laptop. I do assume that they're running with MLX, just because. But-- Fair. Yeah, yeah, exactly. But yeah, the difficulty will be into fixing more complicated things that require very deep reasoning. If you're using a smaller model, then it's going to be obviously not as good as-- But the tools and all the goodies that come with Xcode are going to be usable from a smaller model as well. We have an example in our session where we're actually using Xcode with a QEN 35B. And it goes through the normal tools. So are you using-- this is actually a good point about the question. Are you using the chat? I am using the chat provider, but I know I should be using the-- But it's new. So something that's new in Xcode 27 because we've added ACP support. So you can now actually plug in agents that would talk to models hosted locally on with LM Studio or Alamo or the other local providers. That's something that's new with Xcode 27 and definitely something for you to try out. - Yeah, but even, and you should, people should try this out, but even with the chat, it would still work well. - Still works, but agents are great. - Way better, way better. I'm not saying, but the point is, if they do ask questions, Can I use a local model with Xcode? Yes. The answer is clearly yes. You should use the ACP, but you could also do it any other way. It still should work. So on that topic, what is the advantage of ACP? What does that get you over just the chat completions? Well, the agent client protocol lets you talk to the agent of your choice. So the difference with the chat, it's just a single transcript. Whereas with agents, agents can have sub-agents. They can manage their state. They can do file I.O. The agentic harnesses can go in those loops much, much, much longer than a single conversation. Awesome. So answer to the question is yes, it should work. You should use it. And the crash is a mystery. Unless we have more information, we don't know why it's crashing. So developer forums is a good place to continue this conversation. Thank you. So a question from Brian KM. With regards to UI testing, what practical steps can teams take to integrate automated approaches into their testing workflows on Apple platforms? Should we assume testing here is testing the output of the model? That's a good question. It doesn't vary. It sounds like UI testing. Yeah, it sounds like UI testing, but perhaps using AI tools? Yeah. I like to think about it in a couple of layers. I like to think of first is getting the agent to think in the smallest kind of kernels, right? These are like your unit tests. One of the things that we try to do is build systems in a way that they can be tested independently. And then you get the agent to enumerate all the different possible cases. So your unit tests are really like the place that you think about all the permutations and combinations. And we often don't hook them up to like real backend systems. It's all something that we can run super fast. Then the next layer is we think of, well, let's start bringing in some of those dependencies, and let's have them run a little bit longer, a little bit more expensive. And again, agents are great at helping you kind of structure your modules and your code base to be able to be more testable. Something that I often would do when I was handwriting was, like, not think about those boundaries as much, and I kind of get stuck in the spot where I have to write UI tests for everything. Those are just really expensive. But agents let me be like, actually, you know what? Try to tease this apart and make it a lot more testable and kind of bring it off into a submodule here, bring it into a submodule here, et cetera. And so I kind of view the UI testing as kind of like the last step where you kind of just check to make sure, okay, once I get the UI up and running, does those couple connection points all work out? So I think of like thousands and thousands of unit tests, you know, and kind of these like middle kind of like more expensive integration tests or unit tests. You kind of have a couple hundred of those, and then you have just a handful of UI tests. And then new with Xcode 27 is now we have interaction with the simulator. So you can have the agent actually test that, learn the patterns, and then have it write UI tests so it can do that repeatedly. And you don't have to use the agent every time. Interesting. So the agent can use the simulator to tap and do stuff? Yeah. Wow. That's pretty cool. That's new in Xcode 27. And look at the eye. Like take a screenshot or whatever. Yeah. So the agent can tap. It can swipe. It can type. And then it actually surfaces back the accessibility tree and screenshots. And it can kind of use both and figure out what's on screen and kind of tap through it. So sometimes we'll just let the agent run for a couple hours and look for different bugs and things and give reports of which areas needed the most attention. And then we can kind of write UI tests or even figure out, oh, actually, we should make that a unit test because that system is just really flaky. Does that mean you could just give a bunch of screenshots of the finalized UI and say, go build this, and then it should be able to instantly recreate that? Yeah. Give it a shot. See how it goes. V-Diz, V-Diz. Yeah. Sounds good. Another question here. Have there been any updates to the natural language processing and Apple Vision Kit? Now that the foundation model supports image attachment, what should be the preferred method of image extraction? A few different frameworks mentioned here. Natural language framework, vision framework, I assume, and foundation models having support for image inputs. Yes. Maybe we could start there. Yeah, there's all kinds of stuff in there. I don't know as much about the natural language framework, so I can't speak to that. The vision kit or the vision framework has lots of updates this year. There's some really cool stuff in that. I'd recommend the What's New in Vision session. Image understanding. Yeah, image understanding. Apologies. What's New in Image Understanding. They cover all of it there. There's things like segmentation models, which I think are just really, really cool. I think part of the thing that this question is getting at is when should I use foundation models to understand something about an image versus the vision models. Vision framework, yeah. Yeah. And the line is a little bit blurry, particularly because we're introducing new tools for foundation models that are powered by vision. So we've got a barcode reader tool and we've got a OCR tool that we're adding into the framework this year. And so the way that I think about trying to delineate it is if you're doing something that is pretty well understood, it's roughly the same every time. like you're trying to detect a particular image or a particular object in an image, go for vision because it's optimized for that. It's a well-understood problem. You can test it really well. It's extremely efficient. If you're doing something completely different every time or if it requires semantic understanding or if you're taking a prompt from a user that has some kind of natural language nuance, that's really the domain of foundation models and natural language understanding mixed with, of course, those new vision capabilities. And so then you're going to need to step up to the foundation models framework and tackle it that way. I hope that delineates it. Yeah, so basically, yeah, so if there is something in structure for which there is already an API that fits that structure, use that API. Yeah, definitely. Now the foundation model framework is. Yeah. I talk about it sometimes like foundation models are kind of like a 3D printer. You can do all kinds of stuff with them. And if you have to make lots of custom orders, man, foundation models, 3D printers, they get the job done. But if you're going into production, you've designed your thing, you know what it's going to be, sometimes a baked model that does one thing really well, like a production line that's just stamping out copies of your product, is a little more efficient than a 3D printer. I think the other example I was giving someone else today was the difference between using foundation models framework for translation versus using the translate API. Because there are a subset of languages the foundation models framework supports for translation, whereas the translation framework has a higher number of languages. They are specialized. You know what the input language is. You know what the output language is. It's much easier to use that versus... The difference is, again, to go in the nuance, because you could say translated as if it was 1920. Yeah. You want to change the style. Yeah, there's stuff you really cannot do with... Or if you're translating somebody that's speaking two languages at the same time. Yeah, exactly. I mean, this probably could be done with classic models as well, but the other thing I don't think any classic model would try to support, right? Yeah, exactly. Yeah, so it boils down to the use case, right? If you're just translating something every day in the app or the background, then translate API. But if it is dynamic, you didn't know what was coming, there was different styles or different, then a language model would be more nice. Yeah, good discussion. Next question is from John Lee. On-device LLMs have relatively limited token capacity. What are the best practices for managing prompt size, tool definitions, and context to avoid exceeding limits while still maintaining high-quality response? They don't mention Foundation Models framework. So I assume we can start there. We can also talk about the best practices and if you're shipping your own model in your app and how do you manage. Yeah, maybe start with Foundation. So I can talk a little bit about the APIs that we've added this year to help you do things like this, and then we can all just talk generally about how those should be used and what kind of general practices work best. So in Foundation Models API this year, we've added a couple of things. One of them, we actually added a little bit before. So in iOS 26.4, we added new symbols into the Foundation Models Framework for context size on the system language model, and then also counting tokens. And those are really useful to know first, like programmatically, how much context size do I have to work with? And then given a prompt, how much is this going to take up? We've complemented that this year. Now when you get back a response from the framework, you can access response.usage. And that tells you exactly how many tokens were on the input, how many tokens were on the output. It tells you how many were cached. And if you're using a reasoning model, the output tells you how many of those output tokens were reasoning versus just the rest of the output. By the way, I just want to say that was a highly requested feature. Yes. So thank you for all of you for surfacing it to us so we could, you know. Yeah, exactly. That was one that you guys asked for, and we were like, we can make that happen. So if you use those now, you should be able to track your usage very accurately and understand how much context size you have left. And then you can start employing different strategies to manage that. And so, for example, one of the things that you might want to do, particularly when you're working with the on-device language model, which has a smaller context size, is that after you're done with a tool. So the model has invoked the tool, the tool has generated output, and the model has produced a response from that tool call. You may want to go back and drop the tool call and the tool output because often once you have the answer, that's really all you need. You don't need how you got there anymore. And so if your use case is one of those, that's one thing that you can do. The other thing that you can do, we talked about this a little bit earlier, you can start to drop older entries, the ones that seem to be less relevant. or you can compress them. So we open sourced a new repo, Foundation Models Utilities, and the idea with that repo is to more periodically release high-level components or building blocks that you can use with the Foundation Models Framework or that you can leverage when you're using CoreAI or MLX or any of these other models in a very general-purpose way to build higher-order functionality. And one of the things we put in there is a Summarize History modifier. And so at the beginning of every prompt, we'll check how big your transcript has gotten. And if it's exceeded a certain size that you can specify, then we will summarize the whole thing down to one entry. And we'll start from there. And so you can kind of periodically compress once you get over a certain boundary. And we've made that really easy to do. So we hope that will help a lot, particularly with the on-device model. or when you're moving between a really big model and one that's smaller. Like if you have half the context size with model A that you do with model B, then those tools can be super useful for managing that as you bounce between them. A question on the utilities you mentioned, the open source utilities. Yeah. Would those be applicable if you were using other model providers? Oh, it applies to any backend? Yes, it applies to any backend. This is one of the coolest things about that language model protocol. That's why I'm so jazzed about it. If you conform to that protocol. Everything. You get everything. Everything that builds on foundation models. Nothing special to the. Yeah. Okay. Exactly. So like. That is pretty cool. The dynamic profiles API is really cool. If you haven't had a chance to look at it. Go watch the sessions on it. It's like nothing else that's out there. And you get that for free. If you adopt the language model protocol. And all the stuff that we're putting in this foundation models utility framework. You get that as well. And so it's really worth doing that work up front. How do you think developers should think about the tradeoff between basically invalidating the KV cache when they're dropping tool calls or tool results versus not doing that? Hotly depended, yeah. Right? Like, what's the best approach? Yeah. So this is where evaluations comes in. You teed yourself up. I see what you did there. So, like, there's a debate, right? You can let the context fill up as far as you can because you're not invalidating the KV cache, and so you're keeping your latency low. And then you can eat one big invalidation every now and then where it's a little bit more expensive when you drop stuff or when you reset. Or you can try to do these adjustments, particularly dropping tool calls. You just do that after every response. And then the invalidation is a little bit smaller. It happens more often, but is it as big a deal? if it's small. And then the other thing you have to balance with that is you're thinking about strictly the performance from a latency perspective. There's also the performance from an accuracy perspective. So if I let the model fill up its full context size, maybe I'm doing well performance-wise, but maybe there's a lot of stuff in there that's no longer relevant, and maybe the model gets a little bit distracted by it. And so to answer those questions about, how often do I summarize? You want to start using the evaluations framework I can see for your use case with the model that you're using, does it do well when the transcript gets really long? Does it do better after I've summarized? Right. Does it want more context? Does it want less context? Right? And so you can set up A/B testing effectively with the same evaluation data set. You can see how well it does for each of those different configurations. So set up all of your different dynamic profiles that have those different strategies employed, and then just run them all. And there's this really great feature, which is compare. So you can just literally directly compare the results for all of those different configurations across the same data set and see how well it does. And so it's an incredibly powerful tool. I'm really excited about evaluations. We receive so many questions from developers like, can the model do A or can the model do B? And what it comes down to is your use case is unique. And we don't know the answer for you. But now we have the tools for you to go get the answer and be very confident in it. And that's so important when we're doing, you know, we're moving into this world where there's a lot more non-deterministic behavior. It also means when you reach for a certain model or a certain configuration that you're confident that that is the best one for that job, right? You can test all the different. So everyone's asking, oh, what, should I use PCC or should I use the on-device model for this? Again, evaluations is a really great way to determine that. When you're summarizing, do you, like, you can change the instructions to summarize? Because it's probably like you're using the language model to summarize, right? Yes. So there's some instruction, summarize this in this way, or summarize your context in this way? Yeah, exactly. So the principle that we follow for everything that we put into the API is that if there is a prompt involved, that prompt has to be overwritable. So there's a default one baked into the utilities. Try it out, see if it works for you. Yeah, exactly. Because I wanted to say that even the thing you were saying before, it's not just everything or one time a little thing. It's configurable. You can say, make a summary that is verbose, right? Or make a very concise summary. So then you don't have to do that. And that's another thing to hill climb with. You can edit everything. This is the main point here. It's using a model. It is a very general model. All of them are. You have to actually tune it for your use case. And you have to measure your use case. And that's it. So just to close out this, any last quick best practices on context from your experience on context window management, MLX, Core AI? I think what Eric mentioned about knowing, and again, evaluations are critical, right? Like knowing how much to remove, how much to condense, what to condense, because right, like you might condense to say, oh, I want to keep these many tokens, but maybe you want to remove user tool calls, or maybe there is something else that you want to remove. And a lot of this is very use case dependent. So it's really cool that you can plug in into the existing tools that they are building. And for example, when we were building the demo that we showed earlier today in some other talk, we had the option to build just the LLM directly through Core AI. But we were like, well, we can use the foundation model framework directly, bring in the custom LLM there, and we got a lot of stuff for free. So let's just do that. Yeah. Yeah. So the only thing I could say is you can look at that. But then again, maybe it's not for the phone or anything. But in the more general question of what do we do with the context, right? So we can look to the ML community and the newer models. How do they deal with it, right? And there is a lot of new attentions that are way better behaved in terms of context. It's kind of a nuance. I don't know if people are interested in that part. like, oh, my attention, or it's a sliding window attention, so the context doesn't grow. Or it's a linear attention, so the context is fixed from the beginning. These are fundamental architectural changes that new models have a lot of them. Like most models are now using the variation of all of these. So you do get a better kind of behavior with context. That's one option. And then, obviously, there's other problems that come with-- it's a reasoning model. So reasoning is the first thing that you drop. Yeah, yeah, yeah. That's a great point. Yeah, so reasoning is the first thing that you drop. And then there's other models also that try not to reason as much, because dropping, even if it is reasoning, still costs you latency when you want to process something simple. So new models also, if people are interested, again, in general, new models also have configurable reasoning. And they may also choose to reason or not to reason, which is, again, another thing that people do, so that they don't have to pay the cost of latency, and they can do better handling of longer context. Because if you've done the reason as much, you have more context of useful tokens. Like PCC supports low, medium, high options. Light, medium, deep. But like Angel was saying, there is some models where you fake a large context, but they come with caveats. So I mean, there is so many different options now. One thing I'll mention before we move on is that a naive thing to manage context window, which we've heard from app developers or we've suggested to app developers, like, don't let the single session do too many things. Just break it apart into multiple things, right? Don't give a kind of say, do this and this and this and this. That's just a large, if they're all independent tasks. Like, there could be three separate tasks, and each of them have the whole context window for themselves versus... Also, using the same model as a tool, as we were saying. Yeah, where you get the extra. Awesome. Good discussion. Thank you. We have about 10 more minutes, so let's do a few rapid fire, or to the extent that we can get to a lot of good questions here. So this is from Evo. Foundation models guardrails sometimes refuse. Emotionally intense but legitimate journal entries. Grief venting. Can I prevent refusals on first-person emotional writing, and how do I detect guardrail refusal versus other errors to fall back gracefully? That's a good question. So the answer here is both yes and no. It's a little bit nuanced. If you are taking a journal entry as an input to the model, there is a setting that you can choose with the system language model when you're initializing the guardrails. I believe it's called... Permissive content transformations. Yeah, you got it. You're on the ball. Permissive content transformations. And so if you turn that on, the model will not air out on the input. So you can have some very emotionally charged input. But the model may still refuse in natural language to elaborate on it or to expand on it or to say something that's in the same style as the input. If you're using structured output, so guided generation, then the model can throw a refusal error. The refusal error is different than a guardrail error. The refusal error is the model saying, I'm not going to answer this. It's the model's response. Right. It's the model's response. It's not related to the guardrail. It's just the alignment training of the model. It produces a response for you that says, I'm sorry, I can't help with this. The guardrail errors is a separate kind of error. You can catch that one separately. And that is a separate moderation model that looks at the input, looks at the output, and flags it if it's problematic. Sounds good. And if you were to bring your own model, then these things don't apply. That's correct. That only applies to Apple's models. Right. So for this developer, maybe if some of these things don't work, If it's not behaving as you want, then please file a feedback. Yes. But if it is behaving appropriately but not to your needs, what you want in your use case, there are options to maybe ship a QN model or something else. We've got all kinds of approaches that you can take. I do want to plug, though. We have worked really hard on the guardrails this year. That's been one of the other big things we've got feedback. It shouldn't be a problem, but yeah. It will hopefully be much better this year. We've got lots of data sets on it. We've trained on it very intensively. and the number of false positives should be way down. So give it a shot with the newest version of the system language model. Send us feedback if it's not to your liking and explore all kinds of options. Thank you. Another question here. Apple has historically brought a distinct perspective to areas like design and privacy. I am curious to learn more about your guiding philosophy or approach to AI evaluation. Amazing. What? What an incredible question. So I know we don't have a ton of time. I would easily take the rest of the time. But I just want to say that I think the way that we think about evaluation is really not build your AI feature and then at some point down the road evaluate it. It's really you want to start with evaluation because evaluation is really the living specification of your feature itself. It is all of the things that you think it should be able to do well. It should include some headroom for you to grow into it and actually perform even better than what you're doing right now. It should include edge cases. And really, by doing that, you kind of adopt this mindset of an evaluation-driven development lifecycle, where in education, I actually used to teach for a while, we had this concept of formative assessment, which is an assessment for learning, right? It's not what did you learn already, it's how do you learn more, right? the quizzes themselves give you information back so that you can continue to find your weak spots, and that's what evaluation is. And so I think we've really tried to bring this to developers in a real way through our documentation and through this framework in that we want to make it as easy as possible for you to start from the beginning thinking about what are all the test cases that will allow this to be a great experience for our developers. And so we have tools for even starting with a really small curated data set of your core use cases expanding that. We have some synthetic capabilities for expanding your data set and then running that data set, seeing where the model or your configuration does really well and where it falls down and then being able to go back and make those changes and compare them and really just hill climb on that feature. So I think, you know, there's a lot more that could be said there just in general, but I think just putting it at the forefront of this type of development and making it something that is easy to do is kind of what we've, I think, uniquely been able to accomplish here. Yeah, I must say the best AI products that I've been involved have been driven this way. Exactly. Like evaluation is really the bread and butter of AI. Yeah, and it comes from a lot of experience, you know, having worked on, you know, valuations for our own products, right? Yeah. And so we really have taken all of that different experience and try to build it into the fabric of this framework itself so that developers can adopt that same mindset. This is even very fundamental in a lot of engineering disciplines, right? Verification and validation as part of design and development. Yeah, it's just harder to do with language models, which is why you need a framework. Great discussion. Thank you. This is an interesting question, and we've had this asked many times, Raziel, probably for you. Is it possible for models used by different apps on iPhone to be shared across apps? This could help save storage space for users. Great question. No, it's not possible to share because it's very complicated, right? Like, you may think, oh, if we share a model, then we get to not pollute the, you know, very constrained resources that we have. But at the same time, it becomes harder to control who is using the resource when I'm trying to use it at the same time. Right. Having said that, for example, for model caching that we provide in Core AI that allows you to, you know, keep resources in memory, You can share the resource in the cache group if you have an application group, an app group. So that's something that is possible. But just, you know, coming and saying like, oh, there is one model that will serve multiple apps, it becomes really complicated. The other reason is that I think like we've been talking, right, like these models can be very nuanced into how you want to use them. And even, right, like what are the trade-offs that you're going to make, right? When you run a model, we typically optimize them or quantize them in a particular way. And there is some trade-offs involved in terms of quality, for example, or performance. So even if you are saying, say, okay, I want to use a QN model. Even if it's the same QN model, let's say a QN 6.6 billion, you might want to use a floating point. Because in your evaluation, that's what meets the needs of your use case. But somebody else may say, oh, no, I want to use the four-bit one because I want to buy some performance, and the evaluation tells me that that's fine for my use case. So it becomes really tricky to the fine-grained nature of all these use case-specific optimizations make it tricky to do it that way as well. There's also sandboxing on the, I guess, apps can't access. Exactly. So even download. Because the thing you can do on the Mac, which is you download-- which I'm not even sure if you can do it in Mac app, but you can do it with Python, for instance. You download, and you put it in a shared cache, and then some other app tries to download. And it's like, oh, I found it in the cache. I can use it. That's fundamentally something you can't do on the iPhone for security reasons and other reasons. So it's actually extremely hard to share models across apps. So just for my own clarification there, the model weights, you can put them into a shared app container group, and the weights, you can just share the download, but once the runtime, the loading of them in memory doesn't get shared, right? That's what Raziel was talking about, but I think even completely different apps can't really share the weights either, right? The weights can be shared if the apps are owned by the same developer. Yeah, that's what I mean. If they're not, then they can't. But it would be great in a sense. One can imagine, but it needs to be something way bigger than just a download. One could imagine, oh, I know I have the hash. The weights are correct. Nobody's tampered with them. They're there. But it needs to be a service or something bigger than just, oh, there's a download somewhere in the disk. It's much more complicated. OK, we're out of time. But I think this discussion brings us all the way to the start, where we said if it's foundation models The model is in the OS. It's not a dependency on your app. Try that first. It doesn't contribute to your app size. And there's PCC as an option. Only if you want to bring in your own model, you can use Core AI, MLX. Of course, use evaluations framework for testing and you need help with development. You can use agentic coding capabilities in Xcode. So thank you. So thanks, everyone. That brings us to the end of today's group lab. Thanks for joining us. We hope this was useful. I also want to say thank you to our panelists here, as well as all the folks working hard behind the scenes to help this happen today. So as we mentioned earlier, if we didn't get your questions, please visit developer.apple.com slash forums where we can continue this discussion. And if you have bug reports, feature requests, head over to feedbackassistant.apple.com and share it with us there. Speaking of feedback, you should also receive an email with a survey link to let us know your experience at WWDC. We would love to incorporate your feedback for future events. So thanks again for joining us and hope you have a great WWDC. Namaskara. Thank you. Thank you. Thank you.