How to Generate 3D Models from Text and Images with AI

Text-to-3D with OpenAI Shap-E

Shap-E is OpenAI’s open-source model that generates 3D objects conditioned on text or images. It outputs implicit representations that you can export to standard mesh formats like OBJ and PLY. The fastest way to use it is through the Hugging Face diffusers library, which wraps the model into a familiar pipeline interface.

1
pip install diffusers transformers torch trimesh

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
import torch
from diffusers import ShapEPipeline
from diffusers.utils import export_to_ply

# Load the text-to-3D pipeline
pipe = ShapEPipeline.from_pretrained(
    "openai/shap-e",
    torch_dtype=torch.float16,
    variant="fp16",
)
pipe.to("cuda")

prompt = "a medieval wooden chair"
output = pipe(
    prompt,
    guidance_scale=15.0,
    num_inference_steps=64,
    frame_size=256,
).images

# Export the first result to PLY format
export_to_ply(output[0], "chair.ply")
print("Saved chair.ply")

This produces a colored point cloud in PLY format. For a watertight mesh you can convert PLY to OBJ with trimesh:

1
2
3
4
import trimesh

mesh = trimesh.load("chair.ply")
mesh.export("chair.obj")

Shap-E needs around 7-8 GB of VRAM. If you hit torch.cuda.OutOfMemoryError: CUDA out of memory, drop the frame_size to 128 or move to CPU with pipe.to("cpu") – generation will take minutes instead of seconds, but it works. Another common issue: the progress bar sticks at 0% and never advances. That usually means your PyTorch build doesn’t match your CUDA driver version. Run python -c "import torch; print(torch.cuda.is_available())" to confirm GPU access before you blame the model.

Image-to-3D with TripoSR

TripoSR, built by Stability AI and Tripo AI, reconstructs a textured 3D mesh from a single photograph. Where Shap-E generates from a text description, TripoSR excels at turning a real photo into geometry – it produces results in under a second on an A100.

1
2
3
git clone https://github.com/VAST-AI-Research/TripoSR.git
cd TripoSR
pip install -r requirements.txt

Run it from the command line:

1
python run.py examples/chair.png --output-dir output/

This writes an OBJ file with vertex colors to output/. You can also specify the marching cubes resolution to trade mesh density against speed:

1
2
3
4
python run.py examples/chair.png \
    --output-dir output/ \
    --mc-resolution 256 \
    --no-remove-bg

The --no-remove-bg flag skips background removal when your input already has a clean white or transparent background. Passing a cluttered photo without background removal produces garbage geometry, so only use this flag when you know the subject is already isolated.

Fixing torchmcubes Build Failures

The most common TripoSR installation problem is torchmcubes failing to compile. You’ll see something like:

1
ERROR: Failed building wheel for torchmcubes

or at runtime:

1
AttributeError: module 'torchmcubes_module' has no attribute 'mcubes_cuda'

Fix it by making sure your local CUDA toolkit version matches the CUDA version that PyTorch was compiled with. Check both:

1
2
nvcc --version          # local CUDA toolkit
python -c "import torch; print(torch.version.cuda)"  # PyTorch's CUDA

If they differ, either install the matching CUDA toolkit or reinstall PyTorch for your local version. Then rebuild torchmcubes:

1
2
3
pip install --upgrade setuptools
pip uninstall torchmcubes -y
pip install git+https://github.com/tatsy/torchmcubes.git

At default settings, TripoSR uses around 6 GB of VRAM. It can run on CPU too, though single-image inference jumps from under a second to roughly 30 seconds.

Using the Meshy API for Production Workflows

Open-source models are great for experimentation, but if you need reliable 3D generation at scale without managing GPU infrastructure, the Meshy API is a solid option. It supports both text-to-3D and image-to-3D, exports to GLB, OBJ, FBX, USDZ, and STL, and handles the compute for you.

The workflow is asynchronous: submit a task, poll for completion, then download the result.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
import os
import time
import requests

API_KEY = os.environ["MESHY_API_KEY"]
BASE_URL = "https://api.meshy.ai/openapi/v2"
HEADERS = {"Authorization": f"Bearer {API_KEY}"}

# Step 1: Create a preview task
task = requests.post(
    f"{BASE_URL}/text-to-3d",
    headers=HEADERS,
    json={
        "mode": "preview",
        "prompt": "a sci-fi helmet with glowing blue visor",
        "negative_prompt": "low quality, blurry",
    },
).json()
task_id = task["result"]
print(f"Task created: {task_id}")

# Step 2: Poll until complete
while True:
    status = requests.get(
        f"{BASE_URL}/text-to-3d/{task_id}",
        headers=HEADERS,
    ).json()
    if status["status"] == "SUCCEEDED":
        break
    if status["status"] == "FAILED":
        raise RuntimeError(f"Task failed: {status.get('message')}")
    print(f"Status: {status['status']} — waiting...")
    time.sleep(5)

# Step 3: Download the GLB file
glb_url = status["model_urls"]["glb"]
glb_data = requests.get(glb_url).content
with open("helmet.glb", "wb") as f:
    f.write(glb_data)
print(f"Downloaded helmet.glb ({len(glb_data)} bytes)")

The preview stage generates a low-detail mesh quickly. For production assets, follow it with a refine step by sending another request with "mode": "refine" and "preview_task_id": task_id. The refined model includes proper textures and cleaner geometry.

Meshy also supports image-to-3D through a separate endpoint at /openapi/v2/image-to-3d. Pass an image URL or base64-encoded image and you get the same async task workflow back.

Picking the Right Tool

Each approach fills a different niche:

Tool	Input	Output	Speed	GPU Required
Shap-E	Text or image	PLY, OBJ	~30s on GPU	Yes (7-8 GB)
TripoSR	Single image	OBJ with vertex colors	<1s on A100	Recommended (6 GB)
Meshy API	Text or image	GLB, OBJ, FBX, USDZ, STL	1-3 min	No (cloud)

If you want quick experiments from text descriptions, start with Shap-E. If you have product photos or concept art and need geometry fast, TripoSR is hard to beat. If you’re building a pipeline where reliability and format flexibility matter more than cost, Meshy handles the infrastructure so you don’t have to.

Output Formats and Viewing Your Models

All three tools can produce OBJ files, which nearly every 3D application reads. GLB (binary glTF) is the better choice for web viewers and game engines since it bundles geometry, textures, and materials into a single file.

To quickly inspect a generated mesh without opening Blender:

1
2
pip install trimesh pyglet
python -c "import trimesh; trimesh.load('helmet.glb').show()"

For web embedding, three.js loads GLB files natively with its GLTFLoader. Drop the file on gltf-viewer.donmccurdy.com for a fast sanity check before integrating into your pipeline.

Text-to-3D with OpenAI Shap-E#

Image-to-3D with TripoSR#

Fixing torchmcubes Build Failures#

Using the Meshy API for Production Workflows#

Picking the Right Tool#

Output Formats and Viewing Your Models#

Related Guides#

About the Author