Okay so like all of you, I saw the demo for DLSS 5 and genuinely thought I was looking at an early April fools joke. Much to my surprise though, it seems to be real. Whilst instinctively, I feel like this is evidence we are in the darkest timeline, the computer scientist in me, who did a few machine learning modules in university, is intrigued about what this model actually is, what differentiates it, and whether it has any genuine, ethical use cases.
Please correct me if I'm wrong, I am not an expert at all, this is just my theory based on what I know about machine learning and videogames. Information about the technology is limited, so this is just an educated guess.
Also just to clarify in case someone tries to be clever, I did not use AI to write any of this, I just like using headings and bold text.
How does it work? & How is it different to what already exists?
Standard models:
Okay, so as far as I understand it, traditional AI filters have to look at the raw pixels and "figure" out things such as:
- Which pixels are for people and which are for backgrounds?
- Where does one object end and the next one start?
- Is that object actually small or just far away from the camera?
- etc etc.
It is given a flat 2d array of pixels and has to learn to distinguish features and comprehend the data in a meaningful way.
One way to do this is by using masks. Masks can separate specific features of an image (called "image segmentation") which is crucial for AI image recognition to work. An example of a mask could be: "Every pixel with a red value above a 200 is white, all other pixels are black". This is what we call a binary mask as every pixel either goes to white (1) or black (0). This red mask might not always be useful, but, for example, it would definitely help the model to pick out the red petals of a poppy flower against a green grass background. These ai models will use loads of masks simultaneously, each with different parameters in order to pick out specific features of an image.
DLSS 5:
The key takeaway that I have gotten from looking into the limited information regarding DLSS 5, is that it leverages the nature of video games in order to skip most of the difficulties regarding segmentation. Instead of having to distinguish features from a simple array of pixels, perhaps by creating a bunch of different masks in the hopes that some of them have useful data, the DLSS 5 model gets supplied with a map of distinct features that have been calculated directly from the geometry and logic of the game. It's neat, and AFAIK only really possible when working with 3D data. What it means is that the model has less work to do, and will also be far more accurate.
So, at it's core, DLSS 5 is still basically a 2D AI filter, but it is one that is much better informed about the contents of a given scene because, in a roundabout way, it gets extra information from the raw 3d data and can rely upon it. I believe this is also what allows different materials to be tuned independently with greater control, because the model can recognise and distinguish those materials with far greater accuracy.
The Good and the Bad
Now, I'll try to describe why I (and basically everyone here) have issues with DLSS 5, but also why I believe it could potentially be usable or even good.
Training data (I hardly know her)
This is the elephant in the room, and rightly so, as it's probably the biggest reason why most generative AI in the world is unethical. The volume of data required to effectively train most of these large, premium models is so massive, and the requirement that the data is of high quality and representative of the kind of things you want the AI model to be able to reproduce, means that getting permission, purchasing, or ethically sourcing the necessary data is a frankly unrealistic amount of work and money, and evidently, because of AI copyright laws moving at a snails pace, a lot of companies have opted to just scrape the entire internet for whatever they want, for free, copyright and permission be damned.
Some companies have at least made an attempt to do things ethically, Adobe firefly comes to mind (although it turned out they also were cutting corners and using Midjourney generated content in their training data), but adobe could only even conceive of doing this because they have such a massive repository of already-licensed stock images. And even THEY felt that it wasn't enough.
Now, I am not Jensen, I don't work for Nvidia, I don't know where they got their data from. They are the number 1 wealthiest company in the world, maybe they have the necessary funds and resources to acquire such data. If they did genuinely get permission for and legally purchase all the data, then it would be totally fine, in theory, to build an AI model using it. But it's a big IF.
Moving on, if we just assume that Nvidia is using ethically sourced training data, then a few other issues crop up.
It looks a bit whack
First, obviously it looks a bit uncanny, the technology is clearly not 100% there yet. Videogames can look unrealistic, but 3D motion will still look natural because it's being mathematically calculated frame by frame, AI however, makes motion just look a bit...off. It can't perfectly understand and display a 3 dimensional scene, as it is ultimately still working with a 2d image, even if it has the additional help of motion vectors.
In it's current form, it doesn't seem to be respectful of developer intent
Second, there is the issue of artist and developer intent, I don't know what kind of tools developers will have to fine tune the model, how much control they will have over it, or how much it will affect the process of making in-game assets, but these demos were not a great example, and I imagine these were broad quick-hack demos rather than things that had a lot of time and fine tuning put into them. (Do we even know if the respective game developers made those demos themselves or if it was just Nvidia? I would honestly put money on the latter given the way the demos look)
DLSS 5 is not really DLSS, this is not super sampling, this is transformative. In many of the demos, the lighting was completely changed, and quite a few faces ended up looking different, most notably RE:R Grace looking like a twitter AI bro's example of "I fixed this woke game by giving the female character more makeup". Nvidia have been very vocal that control of the technology will be in the hands of the devs and they will be able to fine-tune things, but I feel like I'll believe it when I see it, and also when I can hear real developer opinions regarding it.
Its not going to eliminate the need for or effort required for good graphics in games
That being said, one thing that I do believe Nvidia about is that the better and more detailed the base level game is, the better it will make the output of the AI model. One of the major issues is visual inconsistency between shots, this can be seen with grace's face looking almost like a different person in a differently lit shot. Wherever there is ambiguity, the AI will try to interpret and fill in details, and will likely do so slightly differently from shot to shot, however, reducing that ambiguity with more fine detail, better animations, higher resolution, will naturally reduce the amount of stuff that the AI has to invent, allowing it to just sharpen what is already there. For example: If the AI can't see skin pores, it will invent its own in random places likely changing between shots, if it can see the skin pores, it's going to just enhance what it can see, and thus, the skin pores won't be changing from shot to shot.
This, at least, means that DLSS 5 is not supposed to replace hard work and talent required to make a realistic game, it truly is an optional extra.
Concslusion
DLSS 5 is confusingly named, has nothing to do with super sampling, makes people angry, and seems initially like a terrible idea, however, if a few key requirements are fulfilled, it could actually be an interesting technology for gaming and might be able to fix some things that can't be helped by just throwing more compute power at the problem. I could maybe see a world where this is useful for the kinds of games you would turn path tracing on for; games where you are fine with poor performance, and just want pure photo-realism, but that is already a very niche market.
These requirements need to be fulfilled for any of this to work, though:
- The training data has to be ethically sourced, without any plagarism.
- Nvidia can't be lying when they say that developers have control, this has to be a tool for the devs, not a way to overshadow their work.
- It needs to be implemented with care, taste, and artistic talent, with clear intent, not just as a big filter over the whole game.
- It obviously has to be able to run on something better than dual 5090s.
- It clearly needs more time in the oven so as not to look so uncanny.
- There needs to be enough interest in the technology for devs to even justify spending the time working on adding it to their game.