Exploring Gaussian Splatting
Turning ordinary videos into 3D scenes you can walk around inside, with nothing but a phone and a regular computer. What I've explored so far, what tripped me up, and why I think this gets weird in a good way.
A Gaussian splat is the first 3D format that has felt like magic to me since I first rotated a cube in WebGL. You take a video, walk it through some software, and what comes out the other side is a scene you can fly through. Not a mesh, not a photo, not a video. The actual room, reconstructed, that you look around inside of.
Here is one I trained. It started life as a walkthrough video of a farmhouse. Load it and drag to look around. On a laptop, use WASD to move; on a phone, there’s a joystick.
3D · gaussian splat
Farmhouse interior
Here’s what I still can’t get over: I didn’t film any of that. I never set foot in that farmhouse. It’s reconstructed from three and a half minutes of a house tour someone posted to YouTube. They walked through a property with a gimbal, uploaded it, and that footage was enough to rebuild the place as a space you can move around inside. This is the exact clip, cued to the section I fed in.
Sit with that for a second. A normal video off the internet, shot by someone who had no idea it would be used this way, and a bit of free software turns it into a place you can move around in. The bottleneck for capturing somewhere in three dimensions used to be a rig and a reason to be there. Now it’s whether a usable video exists.
What you’re looking at is roughly 730,000 tiny 3D blobs. Each one has a position, a color, a size, and a bit of transparency. Stack enough of them and blend them together and the fuzz turns into what reads as a solid room. The clever part is that nobody places them by hand. The software starts with a rough scattering of dots and keeps nudging every blob, over and over, until the image it produces matches the original video. It teaches itself the scene.
That’s the part that hooked me. I’ve been reconstructing rooms, objects, whole spaces from nothing but a video, and the results are good enough that I keep showing people and watching them tilt their head. None of it needed special equipment. I’ve made these on the gaming PC under my desk and even on my laptop. No lab, no scanner, no rig.
From a phone video to a scene
Start to finish it’s four steps, and once I’d scripted it the whole thing runs while I make coffee.
- Film it. Walk through the space slowly and steadily. I’ve mostly used my phone, in 4K. Slow and continuous matters more than long, for reasons I’ll get to.
- Chop it into stills. The video gets sliced into a couple of frames a second.
- Work out the camera path. The software studies all those stills and figures out where the camera was standing for each one. This is the step that decides whether you get a scene or a mess.
- Build the scene. From the stills and the camera path, it grows the blobs until the room appears. About 25 minutes on my computer.
The best result I’ve gotten was the most hobbyist thing imaginable: a tiny 1/64 scale model of a 7-Eleven, the kind of diorama people build for fun. I shot 94 seconds of it on my phone, orbiting slowly. Every frame was usable, and the whole thing, video to finished scene, took 35 minutes and came out sharper than any of the room-sized captures I’d done. Small, controlled, well lit, shot with patience. The diorama taught me more about good filming than any of the big scenes did.
The raw output is huge: the bigger scenes come out at several hundred megabytes, a few million blobs. To put one in a browser you compress it, and the viewer above is the same farmhouse squeezed from a couple hundred megabytes down to 13. Even that has its quirks. My graphics card refuses to do the compression at full color detail and crashes unless I dial it back, the kind of thing you only ever learn by smacking straight into it.
The floaters
Now the part that ate most of my time.
Train a scene and the first thing you notice, after the initial “it worked”, is the junk. Faint translucent sheets hanging in mid-air. Spikes coming off shiny surfaces. A haze where a window should be. These are floaters: blobs the software put in the wrong place, usually because it wasn’t confident where the camera had been looking, so it stranded them in mid-air and never cleaned them up.
The maddening thing is how few of them there are. In one scene I dug into, the floaters were about one percent of all the blobs. But a single floater can be stretched ten meters across, so that one percent can smear over most of what you actually see. In that same scene, around 62,000 blobs had been flung so far outside the room that the whole thing measured 765 meters across. A farmhouse, in a box the size of a few city blocks, almost all of it empty except for a sprinkle of garbage.
I tried things that didn’t work, which is the useful part.
Masking the floaters away while it trained. The idea: paint over the parts of each frame I didn’t want (people walking through, mostly) and tell the software to treat them as empty. I pushed that too hard and it collapsed the entire scene to four blobs. Four, from 1.5 million. Obvious in hindsight: a person stands in a different spot in every frame, so “ignore this” turned into a pile of contradictory instructions, and the least-bad answer the software could find was to erase almost everything. I dropped the masking.
A turntable. For the diorama, my first instinct was to put it on a lazy-Susan, spin it, and hold the camera still. Clean, repeatable. It produced an empty room with a smear in the middle. The software expects the world to hold still while the camera moves around it. I’d done the exact opposite, a still room and a spinning object, so it latched onto the background and treated the thing I actually cared about as noise. The fix looks identical and behaves completely differently: leave the object still and walk the camera around it.
That failure points at the single biggest lever in the whole thing, which is filming, not the software. I have two takes of the same diorama. Take A was longer, three minutes, but I’d swung the camera around too fast and the blur meant it couldn’t connect the frames, so the scene came out in broken fragments. Take B was shorter, two and a half minutes, but slow and steady, and it came together cleanly with twenty times the usable detail. Slower and continuous beats longer with gaps, every time. You’re not filming for a person to watch, you’re filming for software to make sense of.
What did work was cleaning up from both ends. After a scene finishes, I run a quick pass that throws out anything obviously wrong: blobs sitting way outside the room, or stretched implausibly large. That alone shrank the 765-meter mess back down to the size of an actual house and deleted about 66,000 pieces of junk in a few seconds, without touching the real room. And while it trains, telling the software to be less trigger-happy about spawning blobs in the first place leaves me with a scene roughly half the size and just as sharp, which loads faster too.
What I still can’t fully beat is reflections. A shiny sink or a mirror throws off spikes that aren’t out in mid-air and aren’t oversized, so none of my cleanup tricks catch them. That one’s still open.
Where this goes
Everything above produces a frozen moment. The farmhouse you’re looking at is one instant, reconstructed. Nothing in it moves.
The frontier is making it move: 4D splatting, where the Gaussians have a time dimension and the scene plays back like volumetric video you can walk around inside. The research is real and moving fast. The catch is capture. To reconstruct a moving scene from a single camera you’d have to be in every place at once, so the groups doing this convincingly build rooms lined with dozens of synchronized cameras firing together, capturing every angle at every instant. That works, and it is completely out of reach for someone with a phone.
But so was the static version, a few years ago. The static reconstruction I’m doing on a gaming GPU today was a lab with a camera rig not long before that. The interesting question isn’t whether 4D-from-a-phone arrives, it’s what I’ll do with it the week it does. The thing I keep coming back to is memory. A photo of a place you loved is a flat window. A splat is the room. A temporal splat would be the afternoon, the light moving, the people in it, something you could step back into instead of look at.
I don’t think that’s far off. And when it lands, it’ll land on the kind of hardware people already own, in the hands of whoever’s been paying attention.